Thursday 6 December 2012

Wellcome Trust Hack: Data, Tools, and Statistics

These sources (there are some huge ones), are links, with some notes about them

The UK Biobank
Finding your way around the Data
Data Included
The [Avon Longitudinal Study of Parents and Children] ( ([ALSPAC] (
The Economic and Social Data Service

  • ESDS Government large-scale government surveys, such as the Labour Force Survey and the General Household Survey
  • ESDS International multi-nation aggregate databanks, such as World Bank data, and survey data, such as the Eurobarometers and World Values Surveys
  • ESDS Longitudinal major UK surveys following individuals over time, such as the British Household Panel Survey
  • ESDS Qualidata a range of multimedia qualitative data sources
  • historical data from the History Data Service
  • environmental data available from the Relu knowledge portal
  • confidential and potentially disclosive data available from the Secure Data Service
The East of England Public Health Observatory
  • Data in XLS format, no signup.
  • Enormous amounts of data "comprising inpatient, accident and emergency, outpatient and patient reported outcome measures".
  • Various formats, requires mandatory signup.
  • Leads to various places, some superceded by this site
  • Some data not requiring mandatory signup, some do.
  • PDFs detailing responses to a questionnaire about lifestyle.
  • XLSs containing "record level data about NHS services delivered to over a million people with severe and enduring mental health problems each year between 2003 and 2008".
  • Reports in PDF format.
  • No data
  • Really collosal amount of health and social care indicators
Statistical Parametric Mapping
Wellcome Trust Sanger Institute
NHS Statistics & Data Collections
US National Library of Medicine – Data, Tools, and Statistics

The UK Biobank is a project that took detailed information on 500,000 people aged 40 to 69 from the UK, recruited between 2006 and 2010. More information on the study itself can be found in the invitation PDF. Information on how to use the biobank can be found in the user guide, which is quite long, so you might want to use this document for reference instead.
The biobank has essentially three modes for accessing the data:
Browse is a simple tree of data that is ordered by method of capture.
Search looks for whole word matches of what you submit unless you use an asterisk *, which must match at least one character (i.e. sc*ar does not matchscar), phrases can also be searched with quotation marks.
It also only returns results matching all of the input words, which can be modified by the boolean operators | (or), or & (and).
Catalogue allows you to explore the data by type and also lists all documents.
When you reach a data field, you are presented with a graph and some information, you can click on a particular piece of information and it will explain what it means. On the left hand side is information on the distribution of the data.
[from the user guide, 2.1]
[...] Information on a participant’s health and lifestyle, hearing and cognitive function, collected through a touchscreen questionnaire and brief verbal interview. A range of physical measurements were also performed, and which included: blood pressure; arterial stiffness; eye measures (visual acuity, refractometry, intraocular pressure, optical coherence tomography); body composition measures (including impedance); hand-grip strength; ultrasound bone densitometry; spirometry; and an exercise/fitness test with ECG. Samples of blood, urine and saliva were also collected.
ALSPAC is a study of 14,000 pregnant women whose date of delivery fell between April 1991 and December 1992. The women and their children have been followed up sonce and detailed data collected throughout childhood.
The data comprises of a series of postal questionnaires of five types:
  • Carers (usually mothers)
  • Partners
  • Child based (filled out by carer)
  • Child answered
  • School
As well was clinical assesments face to face for a random 10% sample of the cohort children, these children, known as the Children In Focus (CIF) group were invided to attend clinics from 7 years old to undergo tests.
The feature of this website you will find most useful is the data catalogue, described by the ESDS:
The Data Catalogue is an integrated catalogue which contains information on over 5,000 datasets covering an extensive range of key economic and social data, both quantitative and qualitative, spanning many disciplines and themes. The data include those supported by the specialist services:
The catalogue also contains:
To use this data (assuming you have not been issued an account by an institute of higher education) you must fill in this form at the UK data archive website.
The ERPHO releases these datasets on public health (click 'view' to download):
The Wellcome trust has provided some model files for training a specialised kind of learning algorithm for analyising functional anatomy and disease related changes in fMRI scans, which as far as I can tell expresses brain activity in terms of a mass of 3D vectors of depolarisation over time. From the introductory notes:
Statistical parametric mapping is generally used to identify functionally specialized brain responses and is the most prevalent approach to characterizing functional anatomy and disease-related changes.  The alternative perspective, namely that provided by functional integration, requires a different set of [multivariate] approaches that examine the relationship among changes in activity in one brain area others.  Statistical parametric mapping is a voxel-based approach, employing classical inference, to make some comment about regionally specific responses to experimental factors.  In order to assign an observed response to a particular brain structure, or cortical area, the data must conform to a known anatomical space.  Before considering statistical modeling, this chapter deals briefly with how a time-series of images are realigned and mapped into some standard anatomical space (e.g. a stereotactic space).  The general ideas behind statistical parametric mapping are then described and illustrated with attention to the different sorts of inferences that can be made with different experimental designs.
More information on this, including a manual and a video course can be foundhere.
From here can be downloaded the sequenced genome of various species.
A variety of medical data and statistics in CSV and XLS format, with reports in PDF.
An enormous list of data in different formats.

No comments:

Post a Comment