Genome‐Wide Association Studies (GWAS) - izeunice05/ArchivedTools GitHub Wiki
Description: GWAS
Author: Eunice Lee
Part I. Preprocessing of genotype and phenotype data
For this pipeline, it is expected that you are able to open up the terminal (linux environment) and execute the codes below.
- Genotype: Taiwan Biobank {data directory}
- Phenotype: Type 2 Diabetes (binary outcome) {data directory}
- Covariates: age, sex, first 10 principle components {data directory}
a. Quality control of genotype data (array, sequenced, imputed data)
We are going to preprocess or quality control the genotype data using plink commands. Table below shows the usage of plink commands to filter variants and individuals.
Command | Description |
---|---|
--geno | Missingness per marker |
--mind | Missingness per individual |
--maf | Allele frequency |
--hwe | Hardy-Weinberg equilibrium |
--genome | Identity-by-descent |
--make-king-table | Identity-by-descent (plink2.0) |
--snps-only | Exclude indels |
--recode A | Recode to allelic dosage |
Set the directory of where your genotype data is stored. For this pipeline, we are going to use genotype data in plink format.
cd /project/elee4754_1222/gwas_mec/NH/merged/plink_format
If your data is in vcf format, you can convert VCF to plink format. You can find more information regarding submitting a slurm job converting all autosome VCFs to plink format (bed/bim/fam files) under Slurm wiki.
module load bcftools
PLINK=/project/chia657_28/programs/plink/plink1.90/plink
$PLINK --vcf chr1.vcf.gz --make-bed --out output_directory/chr1
Let's perform quality control of genotype data. For the purpose of learning, let's conduct QC on chromosome 22 for now. Genotype QC parameters shown below table can be changed depending on your project.
- Assembly: GRCh38.p13
QC Order | QC parameters | Description/Note |
---|---|---|
1 | geno= 0.05 | Filtered variants with genotyping frequency <95% |
2 | mind= 0.05 | Excluded individuals with genotype rates < 95 % |
3 | maf= 0.01 | Filtered out variants with minor allele frequency <1% |
4 | hwe= 0.0001 | Filtered out variants with HWE exact test p-value < threshold |
5 | min = 0.025 | Removed individuals with relatedness coefficient < cutoff |
6 | snps-only | Excluded indels |
plink --bfile chr22 --geno 0.05 --mac 1 --make-bed --out chr22_geno
plink --bfile chr22_geno --mind 0.05 --make-bed --out chr22_mind
plink --bfile chr22_mind --maf 0.01 --make-bed --out chr22_maf
plink --bfile chr22_maf --hwe 0.0001 --make-bed --out chr22_hwe
plink --bfile chr22_hwe --genome --min 0.025 --make-bed --out chr22_genome
plink --bfile chr22_genome --snps-only --make-bed --out chr22_no_indel
Let's perform quality control for all autosomes (chr1 to chr22).
for i in {1..22};do plink --bfile megagda.chr$i --geno 0.05 --mac 1 --mind 0.05 --maf 0.01 --hwe 0.0001 --genome --min 0.025 --snps-only --make-bed --out qced.megagda.chr$i; done
This is the summary of your QC results (you can go back to your QC logs and make a comparison.
- Summary of SNPs
Command | Removed | Remaining |
---|---|---|
Initial | ||
Geno | ||
MAF | ||
HWE | ||
SNPs only | ||
Total |
- Summary of Individuals
Command | Removed | Remaining |
---|---|---|
Initial | ||
Mind | ||
Relatedness | ||
Phenotype | ||
Total | (cases: ,controls: ) |
Part II. Association tests
b. Conduct GWAS for common variants
We are going to conduct genome-wide association tests for a binary phenotype adjusted for covariates using plink. Save your results in a separate directory or file.
plink --bfile /project/elee4754_1222/gwas_mec/qc_imputed_data/megagda_allchr.snps --logistic --pheno pheno_anyT2D_nh_all.txt --covar covariates_anyT2D_nh_all.txt --allow-no-sex --ci 0.95 --out /project/elee4754_1222/gwas_mec/gwas_T2D/gwas.t2d.snps
c. Create Manhattan plots (in R)
You can export your logistic regression results to your local drive. We are going to plot the results using r-package called 'qqman' (https://r-graph-gallery.com/101_Manhattan_plot.html).
# install and load the package
install.packages("qqman")
library(qqman)
# define the local directory where you transferred GWAS results
setwd("C:/Users/izeun/Documents/usc/research_project/mec_cohort/data/mec_genetics/gwas_dietscore_unhealthy")
files <- list.files(pattern=".txt")
gwasResults <- NULL
for (f in files) {
dat <- read.table(f, header=T, sep="\t", na.strings="")
gwasResults <- rbind(gwasResults, dat)
}
# Manhattan plot (make sure that your data headers (variable names) have the correct names)
manhattan(gwasResults, chr="CHR", bp="BP", snp="SNP", p="P" )
# Generate QQ plot
qq(gwasResults$P)