Genome‐Wide Association Studies (GWAS) - izeunice05/ArchivedTools GitHub Wiki

Description: GWAS

Author: Eunice Lee

Part I. Preprocessing of genotype and phenotype data

For this pipeline, it is expected that you are able to open up the terminal (linux environment) and execute the codes below.

  • Genotype: Taiwan Biobank {data directory}
  • Phenotype: Type 2 Diabetes (binary outcome) {data directory}
  • Covariates: age, sex, first 10 principle components {data directory}

a. Quality control of genotype data (array, sequenced, imputed data)

We are going to preprocess or quality control the genotype data using plink commands. Table below shows the usage of plink commands to filter variants and individuals.

Command Description
--geno Missingness per marker
--mind Missingness per individual
--maf Allele frequency
--hwe Hardy-Weinberg equilibrium
--genome Identity-by-descent
--make-king-table Identity-by-descent (plink2.0)
--snps-only Exclude indels
--recode A Recode to allelic dosage

Set the directory of where your genotype data is stored. For this pipeline, we are going to use genotype data in plink format.

cd /project/elee4754_1222/gwas_mec/NH/merged/plink_format

If your data is in vcf format, you can convert VCF to plink format. You can find more information regarding submitting a slurm job converting all autosome VCFs to plink format (bed/bim/fam files) under Slurm wiki.

module load bcftools
PLINK=/project/chia657_28/programs/plink/plink1.90/plink

$PLINK --vcf chr1.vcf.gz --make-bed --out output_directory/chr1

Let's perform quality control of genotype data. For the purpose of learning, let's conduct QC on chromosome 22 for now. Genotype QC parameters shown below table can be changed depending on your project.

  • Assembly: GRCh38.p13
QC Order QC parameters Description/Note
1 geno= 0.05 Filtered variants with genotyping frequency <95%
2 mind= 0.05 Excluded individuals with genotype rates < 95 %
3 maf= 0.01 Filtered out variants with minor allele frequency <1%
4 hwe= 0.0001 Filtered out variants with HWE exact test p-value < threshold
5 min = 0.025 Removed individuals with relatedness coefficient < cutoff
6 snps-only Excluded indels
plink --bfile chr22 --geno 0.05 --mac 1 --make-bed --out chr22_geno
plink --bfile chr22_geno --mind 0.05 --make-bed --out chr22_mind
plink --bfile chr22_mind --maf 0.01 --make-bed --out chr22_maf
plink --bfile chr22_maf --hwe 0.0001 --make-bed --out chr22_hwe
plink --bfile chr22_hwe --genome --min 0.025 --make-bed --out chr22_genome
plink --bfile chr22_genome --snps-only --make-bed --out chr22_no_indel 

Let's perform quality control for all autosomes (chr1 to chr22).

for i in {1..22};do plink --bfile megagda.chr$i --geno 0.05 --mac 1 --mind 0.05 --maf 0.01 --hwe 0.0001 --genome --min 0.025 --snps-only --make-bed --out qced.megagda.chr$i; done

This is the summary of your QC results (you can go back to your QC logs and make a comparison.

  • Summary of SNPs
Command Removed Remaining
Initial
Geno
MAF
HWE
SNPs only
Total
  • Summary of Individuals
Command Removed Remaining
Initial
Mind
Relatedness
Phenotype
Total (cases: ,controls: )

Part II. Association tests

b. Conduct GWAS for common variants

We are going to conduct genome-wide association tests for a binary phenotype adjusted for covariates using plink. Save your results in a separate directory or file.

plink --bfile /project/elee4754_1222/gwas_mec/qc_imputed_data/megagda_allchr.snps --logistic --pheno pheno_anyT2D_nh_all.txt --covar covariates_anyT2D_nh_all.txt --allow-no-sex --ci 0.95 --out /project/elee4754_1222/gwas_mec/gwas_T2D/gwas.t2d.snps

c. Create Manhattan plots (in R)

You can export your logistic regression results to your local drive. We are going to plot the results using r-package called 'qqman' (https://r-graph-gallery.com/101_Manhattan_plot.html).

# install and load the package
install.packages("qqman")
library(qqman)


# define the local directory where you transferred GWAS results
setwd("C:/Users/izeun/Documents/usc/research_project/mec_cohort/data/mec_genetics/gwas_dietscore_unhealthy")

files <- list.files(pattern=".txt")
gwasResults <- NULL
for (f in files) {
  dat <- read.table(f, header=T, sep="\t", na.strings="")
  gwasResults <- rbind(gwasResults, dat)
}

# Manhattan plot (make sure that your data headers (variable names) have the correct names)
manhattan(gwasResults, chr="CHR", bp="BP", snp="SNP", p="P" )

# Generate QQ plot
qq(gwasResults$P)