5.4.Processing Data directly on Beluga - neurohub/neurohub_documentation GitHub Wiki
This is a kind reminder that NeuroHub users are bound by the collaborator agreement regarding the UK Biobank data.
As per a transition stage, the UK Biobank allows people to analyze data that has already been downloaded within their own secure environments under the terms of their existing Material Transfer Agreement. Thanks to the agreement between NeuroHub and UK Biobank, NeuroHub users can have direct access to the data through 3 different channels:
- The CBRAIN portal, where you can visualize neuroimaging data and process them directly on the platform using integrated tools.
- The UK Biobank LORIS DQT allows you to run queries by category and data fields of your interest but also to process the data directly in CBRAIN
- On Beluga Alliance
The currently available data can be found in our GitHub wiki
Please be advised that the UK Biobank’s data will not be updated until Q4 2024.
More information can be found on the UK Biobank community support.
To avoid unnecessary multiple downloads, we highly encourage our users to access and process the data with NeuroHub via CBRAIN, LORIS or on Beluga and NOT to DOWNLOAD the data on their local computer. By doing so, it will save you time and space, avoiding the hassle of heavy download but it will also be faster to process the data and have your analyses directly available on the cluster.
Recommendations on how you can process and analyze UK Biobank data on Beluga without the need to download the data
-
To help you navigate and run data on a cluster, please find different Command line resources and tutorials: -Digital Research Alliance of Canada training calendar -Alliance wiki page -Tools available on Beluga
-
Here is an example of how to process Exome (category 171) data directly on Beluga using plink tool using the option --show-tags all
- Go to the following path
[xmpham@beluga3 ~]$ cd /lustre03/project/6008063/neurohub/ukbb
[xmpham@beluga3 ukbb]$ ls
bulk example_apptainer.sif imaging README ukbm
covid19 example_singularity.sif imaging_data_transfer_instructions.txt scripts withdrawals
derivatives genetics new tabular
[xmpham@beluga3 ukbb]$ cd genetics/
[xmpham@beluga3 genetics]$ ls
cal exome hap imp int l2r new_genotype_results new_imputation
[xmpham@beluga3 genetics]$ cd exome/
[xmpham@beluga3 exome]$ ls
cram pop_variants_plink pop_variants_pvcf README.txt vcf
For the purpose of this example, we will go to the pop variants plink folder
[xmpham@beluga3 exome]$ cd pop_variants_plink/
[xmpham@beluga3 pop_variants_plink]$ ls
ukb23155_c10_b0_v1.bed ukb23155_c21_b0_v1.bed UKBexomeOQFE_chr10.bim
Xxxxxxx xxxxxxxx
ukb23155_c20_b0_v1.bed ukb23155_cY_b0_v1.bed UKBexomeOQFE_chrX.bim
ukb23155_c20_b0_v1_s200632.fam ukb23155_cY_b0_v1_s200632.fam UKBexomeOQFE_chrY.bim`
Run plink command
[xmpham@beluga3 pop_variants_plink]$ plink --bed ukb23155_cY_b0_v1.bed --fam ukb23155_cY_b0_v1_s200632.fam --maf 0.05 --bim UKBexomeOQFE_chrY.bim --show-tags all --out $HOME/plink.out
[mii] loading StdEnv/2020 plink/1.9b_6.21-x86_64 ...
Due to MODULEPATH changes, the following have been reloaded:
1) mii/1.1.2
The following have been reloaded with a version change:
1) StdEnv/2023 => StdEnv/2020 5) libfabric/1.18.0 => libfabric/1.10.1
2) gcccore/.12.3 => gcccore/.9.3.0 6) openmpi/4.1.5 => openmpi/4.0.3
3) gentoo/2023 => gentoo/2020 7) ucx/1.14.1 => ucx/1.8.0
4) imkl/2023.2.0 => imkl/2020.1.217
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to /home/xmpham/plink.out.log.
Options in effect:
--bed ukb23155_cY_b0_v1.bed
--bim UKBexomeOQFE_chrY.bim
--fam ukb23155_cY_b0_v1_s200632.fam
--maf 0.05
--out /home/xmpham/plink.out
--show-tags all
192035 MB RAM detected; reserving 96017 MB for main workspace.
Allocated 7209 MB successfully, after larger attempt(s) failed.
6661 variants loaded from .bim file.
200643 people (90020 males, 110438 females, 185 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/xmpham/plink.out.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 200643 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 338721 het. haploid genotypes present (see /home/xmpham/plink.out.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.992495.
6637 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
24 variants and 200643 people pass filters and QC.
Note: No phenotypes present.
--show-tags all: Report written to /home/xmpham/plink.out.tags.list .
[xmpham@beluga3 pop_variants_plink]$
The report will be available in your home directory
However, if you have downloaded data from Beluga, please do not leave it on any device that is publicly accessible. Please make sure to delete it locally after you've analyzed it.