5.4.Processing Data directly on Beluga - neurohub/neurohub_documentation GitHub Wiki

This is a kind reminder that NeuroHub users are bound by the collaborator agreement regarding the UK Biobank data.

As per a transition stage, the UK Biobank allows people to analyze data that has already been downloaded within their own secure environments under the terms of their existing Material Transfer Agreement. Thanks to the agreement between NeuroHub and UK Biobank, NeuroHub users can have direct access to the data through 3 different channels:

  1. The CBRAIN portal, where you can visualize neuroimaging data and process them directly on the platform using integrated tools.
  2. The UK Biobank LORIS DQT allows you to run queries by category and data fields of your interest but also to process the data directly in CBRAIN
  3. On Beluga Alliance

The currently available data can be found in our GitHub wiki

Please be advised that the UK Biobank’s data will not be updated until Q4 2024.

More information can be found on the UK Biobank community support.

To avoid unnecessary multiple downloads, we highly encourage our users to access and process the data with NeuroHub via CBRAIN, LORIS or on Beluga and NOT to DOWNLOAD the data on their local computer. By doing so, it will save you time and space, avoiding the hassle of heavy download but it will also be faster to process the data and have your analyses directly available on the cluster.

Recommendations on how you can process and analyze UK Biobank data on Beluga without the need to download the data

  1. Go to the following path
[xmpham@beluga3 ~]$ cd /lustre03/project/6008063/neurohub/ukbb
[xmpham@beluga3 ukbb]$ ls
bulk         example_apptainer.sif    imaging                                 README   ukbm
covid19      example_singularity.sif  imaging_data_transfer_instructions.txt  scripts  withdrawals
derivatives  genetics                 new                                     tabular
[xmpham@beluga3 ukbb]$ cd genetics/
[xmpham@beluga3 genetics]$ ls
cal  exome  hap  imp  int  l2r  new_genotype_results  new_imputation
[xmpham@beluga3 genetics]$ cd exome/
[xmpham@beluga3 exome]$ ls
cram  pop_variants_plink  pop_variants_pvcf  README.txt  vcf 

For the purpose of this example, we will go to the pop variants plink folder

[xmpham@beluga3 exome]$ cd pop_variants_plink/
[xmpham@beluga3 pop_variants_plink]$ ls
ukb23155_c10_b0_v1.bed          ukb23155_c21_b0_v1.bed          UKBexomeOQFE_chr10.bim
Xxxxxxx xxxxxxxx
ukb23155_c20_b0_v1.bed          ukb23155_cY_b0_v1.bed           UKBexomeOQFE_chrX.bim
ukb23155_c20_b0_v1_s200632.fam  ukb23155_cY_b0_v1_s200632.fam   UKBexomeOQFE_chrY.bim`

Run plink command

[xmpham@beluga3 pop_variants_plink]$ plink --bed ukb23155_cY_b0_v1.bed --fam ukb23155_cY_b0_v1_s200632.fam --maf 0.05 --bim UKBexomeOQFE_chrY.bim   --show-tags all   --out $HOME/plink.out
[mii] loading StdEnv/2020 plink/1.9b_6.21-x86_64 ...

Due to MODULEPATH changes, the following have been reloaded:
  1) mii/1.1.2

The following have been reloaded with a version change:
  1) StdEnv/2023 => StdEnv/2020           5) libfabric/1.18.0 => libfabric/1.10.1
  2) gcccore/.12.3 => gcccore/.9.3.0      6) openmpi/4.1.5 => openmpi/4.0.3
  3) gentoo/2023 => gentoo/2020           7) ucx/1.14.1 => ucx/1.8.0
  4) imkl/2023.2.0 => imkl/2020.1.217

PLINK v1.90b6.21 64-bit (19 Oct 2020)          www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to /home/xmpham/plink.out.log.
Options in effect:
  --bed ukb23155_cY_b0_v1.bed
  --bim UKBexomeOQFE_chrY.bim
  --fam ukb23155_cY_b0_v1_s200632.fam
  --maf 0.05
  --out /home/xmpham/plink.out
  --show-tags all

192035 MB RAM detected; reserving 96017 MB for main workspace.
Allocated 7209 MB successfully, after larger attempt(s) failed.
6661 variants loaded from .bim file.
200643 people (90020 males, 110438 females, 185 ambiguous) loaded from .fam.
Ambiguous sex IDs written to /home/xmpham/plink.out.nosex .
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 200643 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Warning: 338721 het. haploid genotypes present (see /home/xmpham/plink.out.hh
); many commands treat these as missing.
Warning: Nonmissing nonmale Y chromosome genotype(s) present; many commands
treat these as missing.
Total genotyping rate is 0.992495.
6637 variants removed due to minor allele threshold(s)
(--maf/--max-maf/--mac/--max-mac).
24 variants and 200643 people pass filters and QC.
Note: No phenotypes present.
--show-tags all: Report written to /home/xmpham/plink.out.tags.list .
[xmpham@beluga3 pop_variants_plink]$ 

The report will be available in your home directory

However, if you have downloaded data from Beluga, please do not leave it on any device that is publicly accessible. Please make sure to delete it locally after you've analyzed it.