Scripts - aandaleon/Ad_PX_pipe GitHub Wiki

If the user formatted their data correctly, almost everything can be run with default. If the scripts do not work, the user should format their data correctly according to formats delineated here. All data folders/hierarchies, as well as file names, should be formatted exactly like those in /home/angela/Ad_PX_pipe/. Look in there and in the wiki before asking me what the input/output should look like.

00_simulate_pheno_covar.R

  • Purpose: make randomized phenotypes and covariates
  • Input: PLINK binary genotype file
  • Output: phenotype file, with IDs, phenotype file, without IDs, covariate file, with IDs, covariate file, without IDs, and phenotype names
  • Example: Rscript 00_simulate_pheno_covar.R --bfile AMR
  • Options:
    • (--bfile) PLINK binary genotype file

02_related_matrix_PCs.R

  • Purpose: make relatedness matrix and calculate principal components
  • Input: PLINK binary genotype file
  • Output: relatedness matrix, without IDs, and principal components
  • Example: Rscript 02_relate_matrix_PCs.R --bfile AMR
  • Options:
    • (--bfile) PLINK binary genotype file
    • (--king; default = /home/angela/px_his_chol/KING_LAPACK/king-offline) KING executable

05a_PrediXcan_dosages_to_GEMMA.py

  • Purpose: convert PrediXcan dosages to GEMMA BIMBAM genotypes
  • Input: PrediXcan dosages, in the folder dosages/
  • Output: GEMMA BIMBAM, in the folder BIMBAM/, and GEMMA SNP annotation, in the folder anno/
  • Example: python 05a_PrediXcan_dosages_to_GEMMA.py --dosage_suffix .txt.gz
  • Options:
    • (--dosage_suffix; default = .maf0.01.r20.8.dosage.txt.gz) Suffix of dosages

05b_make_GEMMA_covars.R

  • Purpose: make covariance file from known covariate and KING PCs
  • Input: Number of principal components to include, covariates, without IDs, and principal components
  • Output: GEMMA covariates
  • Example: Rscript 05b_make_GEMMA_covars.R --pcs_num 5
  • Options:
    • (--pcs_num; default = 5) Number of principal components to include in covariates

05c_GEMMA_wrapper.py

  • Purpose: run GEMMA in either a loop of chromosomes (SNPs, --GWAS) or tissues (genes, --pred_exp)
  • Input: GEMMA BIMBAM in BIMBAM/ or PrediXcan pseudo-genotype in pred_exp_GEMMA/, GEMMA SNP annotation in anno/, relatedness matrix without IDs, phenotypes without IDs, GEMMA covariates, phenotype names, and flag for GWAS or predicted expression analysis
  • Output: Linear mixed model output in output/
  • Examples:
    • python 05c_GEMMA_wrapper.py --GWAS
    • python 05c_GEMMA_wrapper.py --pred_exp
  • Options:
    • (--GWAS) For GWAS-GEMMA
    • (--pred_exp) For “hacked” GEMMA; run only one of these two flags at a time

06_make_pred_exp.py

  • Purpose: calculate predicted gene expressions in PrediXcan using GTEx and MESA models
  • Input: PrediXcan dosages in dosages/
  • Output: Predicted expression files in pred_exp/
  • Example: python 06_make_pred_exp.py
  • Options:
    • (--PrediXcan_path; default = /usr/local/bin/PrediXcan.py) Path to PrediXcan executable
    • (--MESA_prefix; default = /home/lauren/files_for_revisions_plosgen/en_v7/dbs/) Prefix of all MESA models
    • (--MESA_suffix; default = _imputed_10_peer_3_pcs_2.db) Suffix of all MESA models
    • (--GTEx_prefix; default = /home/wheelerlab3/Data/PrediXcan_db/GTEx-V6p-HapMap-2016-09-08/TW_) Prefix of all GTEx models
    • (--GTEx_suffix; default = _0.5.db) Suffix of all GTEx models

07a_convert_PrediXcan_to_GEMMA.py

  • Purpose: convert predicted expression to GEMMA-style pseudo-genotypes
  • Input: Predicted expression in pred_exp/
  • Output: PrediXcan pseudo-genotype in pred_exp_GEMMA/
  • Example: python 07a_convert_PrediXcan_to_GEMMA.py

08_sig_SNP_sig_gene.py

  • Purpose: find significant SNPs from PrediXcan in R
  • Input: Phenotype names and GEMMA GWAS output in output/
  • Output: Significant SNPs and significant genes
  • Example: python 08_sig_SNP_sig_gene.py --SNP_sig 5e-4 --gene_sig 0.05
  • Options:
    • (--SNP_sig; default = 5e-8) Significance threshold for SNPs
    • (--gene_sig; default = 9.654e-6) Significance threshold for genes

09a_GEMMA_to_GCTA-COJO.py

  • Purpose: make GWAS output into GCTA-COJO format
  • Input: PLINK binary genotype .fam file and phenotype names
  • Output: GCTA-COJO .ma for each phenotype
  • Example: python 09a_GEMMA_to_GCTA-COJO.py --fam AMR.fam
  • Options:
    • (--fam) .fam file path

10a_make_COLOC_input.R

  • Purpose: put GWAS and eQTL data into COLOC input format
  • Input: phenotype names and sample size of GWAS cohort
  • Output: COLOC GWAS and eQTL in COLOC_input/
  • Example: Rscript 10a_make_COLOC_input.R --sample_size 320
  • Options:
    • (--sample_size) The sample size of the GWAS data

10b_COLOC_wrapper.py

  • Purpose: run COLOC for all phenotypes and all models
  • Input: phenotype names, sample size of GWAS cohort, and COLOC GWAS and eQTL in COLOC_input/
  • Output: COLOC output in COLOC_results/
  • Example: python 10b_COLOC_wrapper.py --sample_size 320
  • Options:
    • (--sample_size) The sample size of the GWAS data

11_back_elim.R

  • Purpose: perform backward elimination modeling of all significant genes to find independent signals
  • Input: significant genes, phenotype file with IDs, and name of phenotype to analyze
  • Output: Backward elimination results
  • Example: Rscript 11_back_elim.R --sig_gene output/pheno1_sig_genes.txt --pheno_name pheno1
  • Options:
    • (--sig_gene) Path to significant genes file produced in 08_sig_SNP_sig_gene.py
    • (--pheno_name) Name of phenotype to analyze; must match the name given in phenotype with IDs

14a_make_classes.R

  • Purpose: make .classes input for RFMix
  • Input: HAPI-UR haplotype IDs and reference population name
    • Note: input must be in order for HAPI-UR, then refpop
  • Output: RFMix classes
  • Example: Rscript 14a_make_classes.R haplotypes/chr22.phind HIS

15a_RFMix_to_BIMBAM.py

  • Purpose: convert RFMix output into an intermediate for GEMMA input downstream
  • Input: PLINK genotype fam file, HAPI-UR haplotype IDs, HAPI-UR SNPs, RFMix Viterbi, and chromosome number
  • Output: Local ancestry SNP table
  • Example: python 15a_RFMix_to_BIMBAM.py --Viterbi RFMix/chr22.rfmix.2.Viterbi.txt --phind haplotypes/chr22.phind --fam AMR.fam --phsnp haplotypes/chr22.phsnp --chr 22
  • Options:
    • (--Viterbi) Viterbi table output by RFMix
    • (--phind) Haplotype ID file output by HAPI-UR
    • (--phsnp) SNP information file output by HAPI-UR
    • (--fam) PLINK genotype .fam file

15b_loc_anc_wrapper.py

  • Purpose: run phenotype associations with each local ancestry in GEMMA
  • Input: GEMMA BIMBAM in BIMBAM/, GEMMA SNP annotation in anno/, relatedness matrix without IDs, phenotypes without IDs, GEMMA covariates, phenotype names, local ancestry SNP table, reference population, and chromosome number
  • Output: Linear mixed model output for each ancestry (2 in AFA, 3 in HIS) in output/
  • Examples:
    • python 15b_loc_anc_wrapper.py --snptable loc_anc_input/chr22.csv --refpop HIS --chr 22
  • Options:
    • (--snptable) SNP table output from 15a_RFMix_to_BIMBAM.py
    • (--refpop) Reference population, choose from AFA (African American) or HIS (Hispanic)
    • (--chr) Chromosome to analyze; must have corresponding SNP table

All commands together

Rscript 00_simulate_pheno_covar.R --bfile AMR
Rscript 02_relate_matrix_PCs.R --bfile AMR
mkdir dosages/; cd dosages/; wget https://raw.githubusercontent.com/hakyimlab/PrediXcan/master/Software/convert_plink_to_dosage.py; awk '{ print $1"\t"$2 }' ../AMR.fam > samples.txt; python convert_plink_to_dosage.py -b ../AMR -p /usr/local/bin/plink; cd ..; echo "Completed converting dosages."
python 05a_PrediXcan_dosages_to_GEMMA.py --dosage_suffix .txt.gz
Rscript 05b_make_GEMMA_covars.R --pcs_num 5
python 05c_GEMMA_wrapper.py --GWAS
python 06_make_pred_exp.py
python 07a_convert_PrediXcan_to_GEMMA.py
python 05c_GEMMA_wrapper.py --pred_exp
python 08_sig_SNP_sig_gene.py --SNP_sig 5e-4 --gene_sig 0.05
python 09a_GEMMA_to_GCTA-COJO.py --fam AMR.fam
gcta64 --cojo-file pheno1.ma --cojo-slct --cojo-p 5e-4 --bfile AMR --cojo-actual-geno --out pheno1; gcta64 --cojo-file pheno2.ma --cojo-slct --cojo-p 5e-4 --bfile AMR --cojo-actual-geno --out pheno2
Rscript 10a_make_COLOC_input.R --sample_size 347
python 10b_COLOC_wrapper.py --GWAS_sample_size 347
Rscript 11_back_elim.R --sig_gene output/pheno1_sig_genes.txt --pheno pheno_wIID.txt --pheno_name pheno1
plink --bfile AMR --bmerge /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --make-bed --out AMR_w_ref --extract AMR.bim; plink --bfile AMR --exclude AMR_w_ref-merge.missnp --make-bed --out AMR_for_merge; plink --bfile /home/angela/Ad_PX_pipe_data/RFMix/RefPop/HIS --exclude AMR_w_ref-merge.missnp --make-bed --out HIS_for_merge --extract AMR.bim; plink --bfile AMR_for_merge --bmerge HIS_for_merge --make-bed --out AMR_w_ref --allow-no-sex
cat HIS_for_merge.fam AMR_for_merge.fam > merged.fam; plink --bfile AMR_w_ref --indiv-sort file merged.fam --cm-map /home/angela/Ad_PX_pipe_data/cm_map/genetic_map_chr@_combined_b37.txt --make-bed --out AMR_w_ref_ordered
mkdir -p merged_w_ref/; for i in {1..22}; do plink --bfile AMR_w_ref_ordered --chr ${i} --make-bed --out merged_w_ref/chr${i}; done
mkdir -p haplotypes/; for i in {1..22}; do /home/angela/Ad_PX_pipe_data/HAPI-UR/hapi-ur -p merged_w_ref/chr${i} -w 64 -o haplotypes/chr${i}; done
for i in {1..22}; do awk '{print $3}' haplotypes/chr${i}.phsnp > haplotypes/chr${i}.snp_locations; done; Rscript 14a_make_classes.R haplotypes/chr22.phind HIS
mkdir -p RFMix/; cd /home/angela/Ad_PX_pipe_data/RFMix/; python RunRFMix.py -e 2 -w 0.2 --num-threads 10 --use-reference-panels-in-EM --forward-backward PopPhased /home/angela/Ad_PX_pipe/haplotypes/chr22.phgeno /home/angela/Ad_PX_pipe/RFMix.classes /home/angela/Ad_PX_pipe/haplotypes/chr22.snp_locations -o /home/angela/Ad_PX_pipe/RFMix/chr22.rfmix
cd /home/angela/Ad_PX_pipe/; python 15a_RFMix_to_BIMBAM.py --Viterbi RFMix/chr22.rfmix.1.Viterbi.txt --phind haplotypes/chr22.phind --fam AMR.fam --phsnp haplotypes/chr22.phsnp --output_prefix chr22
python 15b_loc_anc_wrapper.py --snptable loc_anc_input/chr22.csv --refpop HIS --chr 22