5.1.Available Data - neurohub/neurohub_documentation GitHub Wiki
This section provides a summary of the data available through the Digital Research Alliance of Canada (the Alliance).
All data is stored on the Beluga cluster under the rpp-aevans-ab
allocation in the following directory:
/lustre03/project/6008063/neurohub/UKB
Tabular
The tabular files contain data that can be summarized by a few entries (e.g. age, blood pressure). The available data fields are summarized in the Data Dictionary (right-click -> save -> open .html file in your browser), and the most recent version of the data is stored in Tabular/
. The Unique Data Identifier (UDI) for each piece of data consists of three parts: [Datafield]-[Instance Index].[Array Index]
:
- Datafield refers to the type of data:
2207
is the datafield for whether the subject wears glasses or contact lenses. - Instance Index refers to the instance of data acquisition.
0-1
are the initial and repeat assessments.2-3
are the imaging and repeat imaging visits. - Array Index refers to the index within an array. Some datafields have multiple values (e.g. 3060-0.0, 3060-0.1, 3060-0.2), and these are stored separately.
A number of formats are available: csv, sas, stata, and r.
.csv
:
Comma-separated values. General-purpose format. Each row describes a subject; each column describes a datafield.- .
sas
/.sd2
:
Format for SAS statistical analysis package. .stata
:
Format for the Stata statistical analysis package..r
/.tab
:
Format for use with R..txt
:
Tab-delimited values. Similar to .csv..bulk
:
List of bulk fields per participant..html
:
Documentation about the field data dictionary and encodings.
Detailed information on how to explore the fields and columns is available in the README
document in the Tabular folder:
/lustre03/project/6008063/neurohub/UKB/Tabular/README
On Beluga, you can open the HTML file (the data dictionary) with a tool like w3m
, where you can search the metadata (including field ID and
column number)
w3m current.html
To explore a TSV-formatted version of this data, you can use a command line like awk
to, for example, print the 2000th column and see the contents of a particular column.
awk -F'\t' '{ print 000 }' current.tab | more
UK Biobank Tabular Preprocessing tool
Given the UK Biobank tabular data is a massive table of over 20 thousand columns for 0.5 million participants, the Computational Brain Anatomy Laboratory at the Douglas Institute has developed a tool to effectively manage this data. The tool will assist you in querying data by transforming the raw tabular data supplied by the UKBB into a format more suitable for analysis in python/R/MATLAB.
To emphasize the utility of that tool, it makes a perfect complement to NeuroHub LORIS DQT.
More information and scripts on how to handle the vast tabular data can be found here.
Versions
The data is periodically updated. Old versions are stored in tabular/archive/
, with the directories identified by the code for the basket. In the case of subject withdrawals, old versions are also purged and researchers are expected to remove withdrawn subjects from any local subsets.
Latest data fields and category addition
The latest data available on Beluga have been downloaded from the UK Biobank’s cloud-based Research Analysis Platform (UKB-RAP).
You can access them in the following directory:
/lustre03/project/6008063/neurohub/UKB/Tabular/RAP
Category | Data field | instance | Description |
---|---|---|---|
2406 | 131022 | NA | Date G20 first reported (parkinson's disease) |
2406 | 131023 | NA | Source of report of G20 (parkinson's disease) |
100 | 24419 | 2-3 | Measure of head motion in T1 structural image |
1839 | 30900 | 0-2-3 | Number of proteins measured |
110 | 26500 | 2-3 | T2-FLAIR used (in addition to T1) to run FreeSurfer |
154 | 120090 | NA | Scale of how much relief pain treatments or medications have provided in the last 24 hours |
100094 | 22189 | NA | Townsend deprivation index at recruitment |
301 | 26206 | NA | Standard PRS for alzheimer's disease (AD) |
302 | 26207 | NA | Enhanced PRS for alzheimer's disease (AD) |
999 | 41000 | NA | Case-control status for COVID19 imaging repeat |
999 | 41001 | NA | Source of positive COVID test result |
157 | 24100 | 2-3 | LV end diastolic volume |
157 | 24101 | 2-3 | LV end systolic volume |
157 | 24102 | 2-3 | LV stroke volume |
157 | 24103 | 2-3 | LV ejection fraction |
157 | 24105 | 2-3 | LV myocardial mass |
157 | 24110 | 2-3 | LA maximum volume |
157 | 24111 | 2-3 | LA minimum volume |
157 | 24112 | 2-3 | LA stroke volume |
157 | 24113 | 2-3 | LA ejection fraction |
157 | 24118 | 2-3 | Ascending aorta maximum area |
157 | 24119 | 2-3 | Ascending aorta minimum area |
157 | 24120 | 2-3 | Ascending aorta distensibility |
157 | 24121 | 2-3 | Descending aorta maximum area |
157 | 24122 | 2-3 | Descending aorta minimum area |
157 | 24123 | 2-3 | Descending aorta distensibility |
112 | 24485 | 2-3 | Total volume of peri-ventricular white matter hyperintensities |
112 | 24486 | 2-3 | Total volume of deep white matter hyperintensities |
162 | 31060 | NA | Left ventricular ejection fraction |
162 | 31061 | NA | Left ventricular end diastolic volume |
162 | 31062 | NA | Left ventricular end systolic volume |
162 | 31063 | NA | Left ventricular mass |
162 | 31064 | NA | Left ventricular stroke volume |
162 | 31068 | NA | Right ventricular ejection fraction |
2003 | 41286 | NA | Mother age on date of delivery |
Category | Description |
---|---|
1020 | Derived accelerometry |
100079 | Derived OCT (optical coherence tomography) measures |
301 | Standard PRS |
302 | Enhanced PRS |
107 | Diffusion brain MRI |
119 | Arterial spin labelling brain MRI |
220 | NMR metabolomics |
1039 | Food (and other) preferences |
506 | Paired associate learning |
109 | Susceptibility weighted brain MRI |
100094 | Baseline characteristics |
100117 | Estimated food nutrients yesterday |
100118 | Total weight by food group yesterday |
100016 | Retinal optical coherence tomography |
Note for category 220 :
- 168 fields are available in
/lustre03/project/6008063/neurohub/UKB/Tabular/672635
- 82 are available in
/lustre03/project/6008063/neurohub/UKB/Tabular/RAP
Working with CSV Files
awk
is a good, general-purpose tool for slicing and dicing csv files. Simpler and faster, though, is using XSV. To load XSV on Alliance resources:
module load rust
cargo install xsv
export PATH=~/.cargo/bin:$PATH
Imaging
Multiple modalities are available from the UK Biobank and the data can be found on Beluga in the following directory:
/lustre03/project/6008063/neurohub/UKB/Bulk
The following table summarizes the current status on Alliance:
Category | Datafield | Instance | Description | Status |
---|---|---|---|---|
106 | 20217 | 2-3 | MR - Task functional brain MRI - DICOM | Available (raw) |
107 | 20218 | 2-3 | MR - Diffusion brain MRI - DICOM | Available (raw) |
109 | 20219 | 2-3 | MR - Susceptibility weighted brain images - DICOM | Available (raw) |
108 | 20224 | 2-3 | MR - Phoenix - DICOM | Available (raw) |
111 | 20225 | 2-3 | MR - Functional brain images - resting - DICOM | Available (raw) |
111 | 20227 | 2-3 | MR - Resting-state fMRI | Available (raw) |
106 | 20249 | 2-3 | MR - Task fMRI | Available (raw) |
107 | 20250 | 2-3 | MR - Diffusion | Available (raw) |
109 | 20251 | 2-3 | MR - SWI | Available (raw) |
110 | 20252 | 2-3 | MR - T1-weighted | Available (raw) |
112 | 20253 | 2-3 | MR - FLAIR | Available (raw) |
119 | 20266 | 2-3 | MR - Arterial spin labelling brain images - DICOM | Available (raw) |
111 | 25750 | 2-3 | MR - Resting functional MRI full correlation matrix, dimension 25 | Available (raw) |
111 | 25751 | 2-3 | MR - Resting functional MRI full correlation matrix, dimension 100 | Available (raw) |
111 | 25752 | 2-3 | MR - Resting partial correlation matrix, dimension 25 | Available (raw) |
111 | 25753 | 2-3 | MR - Resting partial correlation matrix, dimension 100 | Available (raw) |
111 | 25754 | 2-3 | MR - Resting component amplitudes, dimension 25 | Available (raw) |
111 | 25755 | 2-3 | MR - Resting component amplitudes, dimension 100 | Available (raw) |
The latest data field addition
Category | Datafield | Instance | Description | Status |
---|---|---|---|---|
109 | 20219 | 2-3 | Susceptibility weighted brain images - DICOM | Available (raw) |
119 | 20266 | 2-3 | Arterial spin labelling brain images - DICOM | Available (raw) |
Physical measures
This category contains information from physical measurements done at the Assessment Centre. Currently, the following data field is available on Beluga:
/lustre03/project/6008063/neurohub/UKB/Bulk/20205
Category | Datafield | Instance | Description | Status |
---|---|---|---|---|
104 | 20205 | 2-3 | ECG at rest | Available (raw) |
Physical activity
This category provides measurements recorded via a wrist-worn accelerometer. The main data collection (for 100,000 participants) was between June 2013 and January 2016. In 2018, a subset of participants was asked to repeat the exercise up to four times each on a quarterly basis to examine the influence of seasonal effects on the measurements. These seasonal repeats are currently ongoing. The following data fields are currently available on Beluga:
/lustre03/project/6008063/neurohub/UKB/Bulk/9000x
Category | Datafield | Instance | Description | Status |
---|---|---|---|---|
727 | 90001 | 0-1-2-3-4 | Acceleration data (cwa, raw format) | Available (raw) |
727 | 90004 | NA | Acceleration intensity time-series (Epoch) | Available (raw) |
Genetics
Multiple types of genetics data were acquired from UKB subjects.
You can find the data in Genotype_Results/
and Imputation/
directories in:
/lustre03/project/6008063/neurohub/UKB/Bulk
The following sections summarize the current status on Alliance:
Genotype
Category | Datafield | Description | Status |
---|---|---|---|
100313 | 22002 | Genotyping process and sample QC - CEL Files | Available |
100315 | 22418 | Calls | Available |
100315 | 22419 | Genotype confidences | Available |
100315 | 22437 | Copy number variants B-allele frequencies | Available |
100315 | 22431 | Copy number variants, log2ratios | Available |
100315 | 22430 | Intensities | Available |
100319 | 22438 | Haplotypes | Available |
100319 | 22828 | Imputation from genotype (WTCHG) | Available |
Category | Description | Status |
---|---|---|
263 | Genotypes | Available |
This above category is available in the directory:
/lustre03/project/6008063/neurohub/UKB/Bulk/488127/ukb_rel_a45551_s488127.dat
Exome
You can find the Exome data in the genetics/
directory:
/lustre03/project/6008063/neurohub/ukbb/genetics/exome
Category | Datafield | Description | Status |
---|---|---|---|
171 | 23151 | Variant call files | Available |
171 | 23152 | Variant call files indices | Available |
171 | 23153 | CRAM files | Available |
171 | 23154 | CRAM indices | Available |
171 | 23155 | Population-level variants (PLINK) | Available |
171 | 23156 | Population-level variants (pVCF) | Available |
Preprocessed data
NeuroHub users can have access to the following types of Preprocessed data:
1. Diffusion-weighted imaging
The data are processed with Tractoflow and available on Beluga at the following path:
/lustre03/project/rpp-aevans-ab/neurohub/UKB/Derived/tractoflow_out
Documentation on the TractoFlow UKBiobank process is available for your reference.
2. CIVET output
The CIVET outputs have been created out of the UK Biobank 43,000 sujects or so MINC files corresponding to anatomical T1s.
Documentation on CIVET Outputs process of the UK Biobank preprocessing through CBRAIN and the LORIS DQT is available for your reference.
3. CIVET manual QC (Quality Control)
Thanks to the Computational Brain Anatomy Laboratory work at the Douglas Institute on manual CIVET Output QC (Quality Control) for UK BioBank 40,000 users, the CIVET QC data are now available through CBRAIN and soon, on the LORIS DQT. Documentation on CIVET manual QC access is available for your reference.
4. UKBiobank fMRIprep outputs
All UKBiobank fMRIprep outputs are now available in CBRAIN and Beluga. You can find more information about UKBiobank fMRIprep outputs for your reference.
5. T1w and FLAIR scans processed by NIST-MNI minipipeline
The outputs are anatomical MRI scans of the UK Biobank imaging data processed in-house tools, based on minc-toolkit-v2 version 1.9.17.
The files are available in Beluga at the following path:
/lustre03/project/6008063/neurohub/UKB/Derived/vfonov/out
Documentation about the NIST-MNI-minipipeline is available for your reference.
Food Preference Questionnaire and Paired Associate Learning
Category 1039
The Food Preference Questionnaire fields are available as a separate CSV table. The questionnaire includes 150 items, which comprise food items that reflect both sensory preferences (bitter, sweet etc.) and foodstuff preferences (fruit, vegetables, meat, etc.). More information can be found here.
Category 506
In addition, the Paired Associate Learning fields are also available as a separate CSV table. In the test participants are shown 12 pairs of words then, after an interval, presented with the first word of 10 of these pairs and asked to select the matching second word from a choice of 4 alternatives. More information can be found here.
Both categories are available at the following path on Beluga:
/lustre03/project/6008063/neurohub/UKB/Tabular/RAP
Status Codes
Status | Meaning |
---|---|
Available | Data is accessible to NeuroHub users |
Available (raw) | The raw data is available, but derivatives may not be. |
Deploying | Data will soon be available. |
Fetching | Data is currently being transferred to Beluga |
Processing | Data is undergoing processing (e.g. format, structure). |
Not on Beluga | Data has not been downloaded. |
Requests
Are there new datafields or return datasets that you'd like to be made available? Send us an email @ [email protected] with a text file with the following information in a single column:
- Datafields, prepended with 'F' (e.g.: "F20252")
- Return datasets, prepended with 'R' (e.g.: "R123")
- SNP IDs prepended with 'S' (e.g. "S456").
Is there bulk data that is already authorized but unavailable? Let us know @ [email protected]; we prioritize data with community interest.