Walkthrough - aandaleon/Ad_PX_pipe GitHub Wiki

Sample Data Phenotypes and Covariates

(00_simulate_pheno_covar.R) We will produce randomized phenotypes and covariates, as the 1000G data does not have any phenotypes associated with it, but this does not apply in the real-life data you will study. Raw data that you will analyze is normally messy, including low accuracy SNPs or individuals that are outliers in the phenotype (ex. TRIG > 1,500 mg/dL), so we must perform quality control to ensure that our findings are accurate and not simply an artifact of messy data. We would normally perform quality control in PLINK using Ryan’s gwas_qc pipeline, but the test data has already been filtered. Please use the gwasqc_pipeline for your own raw data.

Principal Components

A primary concern with admixed populations is population structure due to the differences in allele frequencies continentally, which becomes complicated in individuals with multi-continental ancestry such as AMR. We calculate principal components to correct for stratification, because, without these data, the phenotypic findings at the end are heavily confounded, inflated, and unreliable. Additionally, another source of confounding is relatedness within cohorts, which we may not be able to remove completely without sacrificing a large portion of our cohort. GWAS software such as GEMMA can account for this relatedness computationally with the input of the cohort’s relationship matrix, a measurement of the relatedness between all members of a cohort. (02_relate_matrix_PCs.R) We calculate principal components and a relationship matrix in KING, software optimized for structured populations.

Principal component plot of an admixed Hispanic cohort

Imputation

Another issue with raw data is the lack of coverage across the genome, as many commercial genotyping arrays capture less than 500k or 1m SNPs across the genome. We can infer the missing data in between our known data by performing imputation, which uses whole-genome sequences to “fill in” the gaps where SNPs are known to be inherited together most of the time. Current GWAS input data are mainly based on genotyping chips collecting single-nucleotide variations and frequently contain 200,000-2,000,000 SNPs commonly-occurring SNPs compared to the human genome's estimated 3 billion base pairs due to price and speed. To compensate, chips are designed to capture as many independent SNPs as economically possible and depend on LD and imputation to interpolate the missing portions, relying on comprehensive understanding and matching population data for the intended population.

Overview of Imputation

Overview of imputation

Due to the dropping cost of whole genome sequencing since the first human genome in 2003, thousands of individuals have been sequenced and assembled into various reference panels, such as 1000G, the Haplotype Reference Consortium (HRC), and the African Genome Resources panel (AGR). These reference panels are used to impute, or interpolate based on collected data, variants between SNPs in genotyped individuals. In regular data, we would then impute data with the Michigan Imputation Server to increase the number of SNPs in linkage disequilibrium with our data. (03_Michigan_Imputation_Server.pptx) The instructions for this are included in the powerpoint in the repository. This imputation produces another type of genotype format called a VCF (variant call format), so we have to convert it to a useable format for our other software. We convert our genotypes to PrediXcan-style dosages with pre-built scripts. For Michigan Server data, there is a separate python script that converts from Michigan .vcf format to PrediXcan format. This does not apply with our sample data.

Genome-wide Association Studies (GWAS)

A widely popular method of studying the associations between genetic variants and phenotypes is the genome-wide association study, abbreviated as GWAS. Input data for GWAS include DNA genotypes, a collection of single-nucleotide polymorphisms (SNPs), as well as a collection of a particular phenotype under investigation, such as prostate cancer, schizophrenia, or lipid levels, for genotyped individuals. Genotype data are most commonly collected on a microarray and describe a single allele change or short indels in a particular place in the genome. GWAS are performed by regressing genotypes, including hundreds of thousands or millions of SNPs, against a case-control or quantitative phenotype to find statistically significant associations between a particular genotype and phenotype, with genome-wide significance designated as P < 5e-8.

A Manhattan plot, a common way to depict GWAS results

This begins our genome-wide association study using GEMMA. GEMMA, standing for Genome-wide Efficient Mixed Model Association, performs a genome-wide association study while also accounting for relatedness and given covariates, making it ideal for related and structure cohorts. (05a_PrediXcan_dosages_to_GEMMA.py) We convert from the PrediXcan dosage format to the similar BIMBAM format, which is the genotype input for GEMMA. (05b_make_GEMMA_covars.R) The next script makes a covariance file from known covariates and a user-determined number of principal components from KING. It is important that you include covariates that may confound your phenotype, such as lipid-lowering medications when studying cholesterol levels. (05c_GEMMA_wrapper.py) We then run all these data across all 22 autosomal chromosomes and in all supplied phenotypes in a loop.

PrediXcan

Because transcriptome data is difficult and expensive to collect and sequence, there are only a few popular datasets available, such as FHS whole blood and GTEx. By creating reference models from these transcriptome data, similar to reference panels for genotype imputation, researchers can predict transcriptomes for results on a more actionable gene-level basis rather than a SNP-level basis. In one method, PrediXcan, this is done by predicting the heritable, genetic component of gene expression and creating additive models of gene expression levels. These models can then translate genotype input into predicted gene expression levels and regress them against a phenotype of interest, creating a gene-level association study with just genotype data as input rather than transcriptome data. Though we have known SNP-level data, we would like to know how they affect biological mechanisms, such as gene expression, which is a much more feasible target for precision medicine.

Predicted gene expression vs. phenotype

Gene expression vs. phenotype

We do not have the cohort’s gene expression profile available (because that’s expensive and difficult), but we can predict the “missing” data similarly to how we imputed our genotype data - using reference models in PrediXcan. (06_make_pred_exp.py) We start the imputed transcriptome based association study by calculating predicted gene expression in 44 GTEx tissues and 5 MESA models. (07a_convert_PrediXcan_to_GEMMA.py) We then convert these predicted expression data into pseudo-genotypes to input into GEMMA similar to BIMBAM, (05c_GEMMA_wrapper.py) and run all phenotypes, populations, and tissues in a loop in GEMMA similar to the GWAS, accounting for relatedness and covariates.

Significance of Results

(08_sig_SNP_sig_gene.py) We then extract the significant SNPs with the threshold determined by the user. After determining significant SNPs, we seek to find independently associated loci instead of cutting off independence at an arbitrary distance. (09a_GEMMA_to_GCTA-COJO.py) We calculate independent significant SNPs in a joint analysis in GCTA-COJO, starting by converting the GWAS output into GCTA-COJO format, then running GCTA-COJO separately in each phenotype.

Colocalization

Additionally, with our GWAS results, we seek to find if our SNPs also have biological significance. One type of SNPs with biological significance is an expression quantitative trait loci, which associates variations in gene expression levels to genotypes. (10a_make_COLOC_input.R and 10b_COLOC_wrapper.py) After we have our significant gene results, we will then perform colocalization testing between GWAS results and eQTL data using COLOC in a COLOC wrapper. COLOC estimates if GWAS and eQTL signals are linked and colocalized (P4 > 0.5), if they are independent from each other (P3 > 0.5), or neither.

An example of unknown sharing of signals, independent signals, and colocalized signals

Backward Elimination

Genes are not independent of each other and may have correlated expressions, but we seek to find independent, possibly causal gene signals. We perform backward elimination modeling of all significant genes using stepwise regression to find statistically significant predictors for our phenotypes.

Local Ancestry

A unique aspect of admixed populations is their multi-continental ancestry, which creates a mosaic of ancestries along the genome known as local ancestry. For African-American cohorts, this is usually a two-way admixture between European and West African populations, and for Hispanic populations such as AMR, this is usually a three-way admixture between Native American, European, and West African populations. We infer this “mosaic” using local ancestry analyses in RFMix, starting with making reference populations in PLINK, which will be determined mainly by the population in question. For African-American populations, this will be CEU (Central Europeans in Utah) and YRI (Yoruba in Ibadan, Nigeria). For Hispanic populations, this will be IBS (Iberian in Spain), NAT (Native American), and YRI. We infer haplotypes using HAPI-UR from our genotypes. Using these haplotypes, we calculate local ancestry estimations in RFMix.

Chromosome painting of multi-continental ancestry

Admixture Mapping

These different blocks of local ancestry may contribute differently to the phenotype, with previous studies finding Hispanic-specific or African-specific SNPs contributing the most to the phenotype. Using our local ancestry estimates, we can also test if the presence of local ancestry is also significantly associated with our phenotype in admixture mapping. We convert these local ancestry estimations for each haplotype generated in HAPI-UR and [RFMix] for use in GEMMA to perform admixture mapping.