Run pca by using loadings - genetics-of-dna-methylation-consortium/godmc_phase2 GitHub Wiki

MODULE STATUS

Developers: Haotian Tang and Josine Min

Scripts status: In development

Prerequisite scripts: 00-setup_folders.sh, 01-check_data.sh

Data upload method: email

Background

Principal Component Analysis (PCA) is performed in script 02a, using only cohort-specific data. Some cohorts may exhibit a V-shaped or poorly clustered distribution of samples when plotting PC1 vs. PC2. To further investigate this issue, here we use PCA loadings built from the unrelated and qualified samples of gnomad v3.1.2. We use PCA loadings derived from unrelated and high-quality samples in gnomAD v3.1.2. SNPs used for loadings are based on the Global Biobank PCA loadings, with stringent LD pruning (r² = 0.1 and window = 10,000,000) and exclusion of high-LD regions. This module requires the Hail Python package.

The following files are required to conduct this module:

hgdp_tgp_unrel_pass_filtGBMI_strictpruned_scores.tsv
release_3.1.2_vcf_genomes_gnomad.genomes.v3.1.2.hgdp_1kg_subset_sample_meta.tsv.bgz
references_grch37_to_grch38.over.chain.gz
hgdp_tgp_unrel_pass_filtGBMI_strictpruned_loadings.tsv
hail_env.yml

To download these files, please run under your godmc_phase2 repository:

git fetch origin
git checkout pca_loadings

To set up your environment for running Hail, choose one of the following options:

Option1: Create an environment for hail using mamba/conda

We recommend using mamba/conda to create the environment (see mamba installation guide and conda installation guide). In your godmc_phase2 directory, please run:

mamba env create -f ./resources/genetics/hail_env.yml

This environment will use ~4.1 GB of storage. If using conda, replace mamba with conda in the command.

Option2: If using your own Python 3 environment, instead of conda/mamba env, ensure it is version 3.9 or later and specify the Python 3 path in your config file. Then, install hail through hail installation instruction.

Then, to check if Hail has been installed successfully in the environment:

mamba activate hail_env
pip list | grep hail

It would show:hail 0.2.135 Otherwise, please run:

pip install hail

Finally, to run PCA by using loadings, execute the following script:

./01b-pca_loadings.sh

Running this script on genetic data with ~3,000 samples and ~6.8 million variants takes approximately 10 minutes by using 65 GB of memory.

Once PCA is complete, please compress and encrypt the results:

source config
cd ${home_directory}
tar -zcf results/01/${study_name}_pcaloadings.tgz results/01/${study_name}_globalPCA.png results/01/logs_b/
gpg --output results/01/${study_name}_pcaloadings.tgz.gpg --symmetric --cipher-algo AES256 results/01/${study_name}_pcaloadings.tgz

Please download the gpg file from `results/01` folder to your local machine and share it with Haotian Tang via email ([email protected]), along with your encryption passphrase.