Run pca by using loadings - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki
MODULE STATUS
Developers: Haotian Tang and Josine Min
Scripts status: In development
Prerequisite scripts: 00-setup_folders.sh, 01-check_data.sh
Data upload method: email
Background
Principal Component Analysis (PCA) is performed in script 02a, using only cohort-specific data. Some cohorts may exhibit a V-shaped or poorly clustered distribution of samples when plotting PC1 vs. PC2. To further investigate this issue, here we use PCA loadings built from the unrelated and qualified samples of gnomad v3.1.2. We use PCA loadings derived from unrelated and high-quality samples in gnomAD v3.1.2. SNPs used for loadings are based on the Global Biobank PCA loadings, with stringent LD pruning (r² = 0.1
and window = 10,000,000
) and exclusion of high-LD regions. This module requires the Hail Python package.
The following files are required to conduct this module:
hgdp_tgp_unrel_pass_filtGBMI_strictpruned_scores.tsv
release_3.1.2_vcf_genomes_gnomad.genomes.v3.1.2.hgdp_1kg_subset_sample_meta.tsv.bgz
references_grch37_to_grch38.over.chain.gz
hgdp_tgp_unrel_pass_filtGBMI_strictpruned_loadings.tsv
hail_env.yml
To download these files, please run under your godmc_phase2 repository:
git fetch origin
git checkout pca_loadings
To set up your environment for running Hail, choose one of the following options:
Option1: Create an environment for hail using mamba/conda
We recommend using mamba/conda to create the environment (see mamba installation guide and conda installation guide). In your godmc_phase2 directory, please run:
mamba env create -f ./resources/genetics/hail_env.yml
This environment will use ~4.1 GB of storage. If using conda, replace mamba
with conda
in the command.
hail installation instruction.
Option2: If using your own Python 3 environment, instead of conda/mamba env, ensure it is version 3.9 or later and specify the Python 3 path in your config file. Then, install hail throughThen, to check if Hail has been installed successfully in the environment:
mamba activate hail_env
pip list | grep hail
It would show:hail 0.2.135
Otherwise, please run:
pip install hail
Finally, to run PCA by using loadings, execute the following script:
./01b-pca_loadings.sh
Running this script on genetic data with ~3,000 samples and ~6.8 million variants takes approximately 10 minutes by using 65 GB of memory.
Once PCA is complete, please compress and encrypt the results:
source config
cd ${home_directory}
tar -zcf results/01/${study_name}_pcaloadings.tgz results/01/${study_name}_globalPCA.png results/01/logs_b/
gpg --output results/01/${study_name}_pcaloadings.tgz.gpg --symmetric --cipher-algo AES256 results/01/${study_name}_pcaloadings.tgz