017: GWAS with PLINK - McGranahanLab/Guidebook GitHub Wiki

A small note for the reader: there are couple of references to Drosophila here, because the author of this tutorial used to do GWAS on fruit flies.

Intro

Plink is a free and the most used software for conducting GWAS. It fully operates through the command line and can compute a 1 phenotype association to a cohort of Drosophila samples under several minutes.

In a nutshell, GWAS associates genotypes with phenotypes, thus, we will need 2 mandatory data inputs, such as variant file (aka VCF) and phenotype file. Covariates can be optionally supplemented via one more, 3rd file. Unfortunately, PLINK does not take VCF as an input, but has its own input format:

.ped file which contains information on the individuals and their genotypes
.map file which contains information on the genetic markers Classically for PLINK, the phenotype is actually written in the ped file, but I do not do that because usually I test a lot of phenotypes. In such case I put phenotypes in the separate file. Both ped and map files are in fact usual text files, they are rather big. As usual, a compression to binary file can be applied, then PLINK will require 3 binary files as an input: a
.bed binary file that contains individual identifiers (IDs) and genotypes
.fam text file that contain information on the individuals
.bim text file that contain information on the genetic markers

Because reading large text files can be time‐consuming, it is recommended to use binary files. However, I prefer not binary ped and map files in my practice. Let’s consider ped and map formats now.

Input files

Ped file

The PED file is a white-space (space or tab) delimited file without header. The first six columns are mandatory:

Family	ID	Individual ID	Paternal ID	Maternal ID	Sex (1 = F, 2= M)	Phenotype

Specifications:

the combination of family and individual ID should uniquely identify a person
PED file must have 1 and only 1 phenotype in the sixth column
phenotype can be either a quantitative trait or an affection status, PLINK will automatically detect the type. Missing value for phenotype by default is -9.
Quantitative traits with decimal points must be coded with a period/full-stop character and not a comma, i.e. 2.394 not 2,394
If an individual's sex is unknown, then any character other than 1 or 2 can be used. However, these individuals will be dropped from any analyses. To disable the automatic setting of the phenotype to missing if the individual has an ambiguous sex code, add the --allow-no-sex option
Other phenotypes can be swapped in by using the --pheno (and possibly --mpheno) option, which specify an alternate phenotype is to be used

Then, the columns 7 and onwards contain genotypes of the individuals. The genotypes can be coded by any character (e.g. 1,2,3,4 or A,C,G,T or anything else) except 0 which is, by default, the missing genotype character. All markers should be biallelic. ALL SNPs (whether haploid or not) must have two alleles specified.

Example of the ped file:

Family	ID	Individual ID	Paternal ID	Maternal ID	Sex	Phenotype	SNP1_A1	SNP1_A2	SNP2_A1	SNP2_A2	SNP3_A1	SNP3_A2
FAM001	1	0	0	0	1	2	A	G	G	G	A	C
FAM001	2	0	0	1	2	2	A	A	A	G	0	0

Note: The header isn’t usually present in the file. If you want it in the file, put # in the beginning of the line. Most of the times I put dummy phenotype inside ped, because I need to test many phenotypes and it’s easier to supplement them though a separate file.

Map file

Map file contains information on the name and position of the variants in the ped file. It is again a white-space (space or tab) delimited file without header. There are 4 columns in it:

Chromosome	VariantID	Position in morgans	Base pair coordinate

Specifications:

position in morgans = physical position of the variant. It is safe to use dummy value of 0 instead of position in morgans
The autosomes should be coded 1 through 22. In Drosophila, I usually replace chr2L with chr1, chr2R with chr2, etc.
The order of the variants in this file should follow the order of the variants in ped file
The following other codes can be used to specify other chromosome types:

    X    X chromosome                    -> 23
    Y    Y chromosome                    -> 24
    XY   Pseudo-autosomal region of X    -> 25
    MT   Mitochondrial                   -> 26

Example of map file:

Chromosome	VariantID	Position in morgans	Base pair coordinate
chr1	SNP1	1000	3556
chr2	SNP2	8689	10595
chr3	SNP3	7897	12598

Note: The header isn’t usually present in the file. If you want it in the file, put # in the beginning of the file.

Phenotype file

Phenotype file contains information about phenotype values in the individuals from ped file, it’s tab or space delimited. There are 2 columns which define sample and one column per phenotype. Phenotypes can be binary or continuous. One can mix phenotypes of different types in the same file, i.e. phenotype #1 can be continuous and phenotype #2 can be binary. The header is present in the file.

FamilyID	IndividualID	PhenotypeID_1	PhenotypeID_2

Example of phenotype file:

FamilyID	IndividualID	PhenotypeID_1	PhenotypeID_2
FAM001	1	female	0.46464
FAM001	2	male	0.6353

Covariate file

There is a possibility to supply PLINK with covariates measured for the study. The file is a tab/space delimited table with the header. Example:

FamilyID	IndividualID	Age	BMI
FAM001	1	40	20
FAM001	2	35	17

GWAS

Now, then we know all file formats, we can perform GWAS. I assume that you have vcf file with variants, phenotype file with the phenotypes you would like to test and a table with covariates (optional). Also, phenotype and covariate file should be formatted as described above.

Step 1: create ped and map files from VCF

To do so, vcftools are used:

vcftools --vcf input.vcf  --plink  --out plink_input_from_vcf

This will create the map and ped files which will be used in the following steps. Or directly in PLINK:

plink --vcf input.vcf  --allow-extra-chr  --out plink_input_from_vcf

This will create a binary bed, fam, bim files. I prefer vcftools. In case you prefer to work with binary files, always write --bfile instead of --file in all commands below.

Step 2: pruning of the variants

Variants which are in LD will inflate p-values of the association and thus are needed to be pruned:

plink --file plink_input --indep-pairwise 500 5 0.2 --allow-extra-chr

--indep-pairwise requires three parameters:

a window size in variant count or kilobase (if the 'kb' modifier is present) units
a variant count to shift the window at the end of each step
a pairwise r2 threshold: at each step, pairs of variants in the current window with squared correlation greater than the threshold are noted, and variants are pruned from the window until no such pairs remain.

--allow-extra-chr allow non-conventional chromosome names.

This command will produce 2 files: plink.prune.in and plink.prune.out. The first file will contain variants which survived pruning, and the second one will contain variants which didn’t. However, both types of variants, pruned or not, are still in the plink_input.ped. In fact, it wasn’t touched. Let’s remove them now:

plink --file  plink_input 
         --exclude plink.prune.out 
         --allow-extra-chr --recode --out plink_input_pruned

--exclude allows to exclude variants listed in plink.prune.out. This command will create new map and ped files plink_input_pruned.map and plink_input_pruned.ped.

Step 3: cutoff on MAF and genotyping rate

Usually we are not interested in testing rare variants, thus they needed to be removed. The cutoff on MAF depends on the size of the dataset, i.e. for the 100 individuals minimal MAF would be 0.05, but for 1000 it can be 0.01. We will also filter out all variants which have low genotyping rate.

plink --file plink_input_pruned 
         --allow-extra-chr  
         --maf 0.05 
         --geno 0.2 --recode --out plink_input_pruned_maf

--maf minimal minor allele frequency for the variant to be included

--geno filters out all variants with missing call rates exceeding the provided value

Step 4: running PLINK

Finally, we can run detection of associations

plink --file plink_input_pruned_maf --assoc 
         --pheno phenotype.txt --all-pheno 
         --covar covariates.txt 
         --allow-no-sex 
         --missing-phenotype MG 
         --adjust
         --allow-extra-chr --out plink_result

--assoc basic association test (Chi-square). It’s also possible to have --model (Cochran-Armitage trend test or Genotypic (2 df) test or Dominant gene action (1df) test or Recessive gene action (1df) test), --fisher (Fisher test), --linear (linear regression) and --logistic (logistic regression) instead of --assoc. --pheno path to the phenotype file --all-pheno tells PLINK that all phenotypes in the pheno file should be tested --covar path to covariate file (optional) --allow-no-sex by default, PLINK will exclude individuals, for which sex is unknown, this option will prevent it from doing it --missing-phenotype supply to PLINK a code for the missing phenotype. By default it’s -9. --adjust adjust p-values for multiple testing. In case of multiple phenotype usage, it will not correct for multiple phenotypes.

As a result, there will be 1 file per phenotype with the extension .assoc, with the following columns in case --assoc was used:

CHR	SNP	BP	A1	F_A	F_U	A2	CHISQ	P	OR
Chromosome	SNP ID	Physical position (base-pair)	Minor allele name (based on whole sample)	Frequency of this allele in cases	Frequency of this allele in controls	Major allele name	Basic allelic test chi-square	P-value (unadjusted)	odds ratio

In case other test is used (--model / --fisher / --linear / --logistic) the output will also be a table, but the column names will be different. One can check out all the possible output tables here.

The adjusted p-value will be output to the separate file with the following columns

CHR	SNP	UNADJ	GC	BONF	HOLM	SIDAK_SS	SIDAK_SD	FDR_BH	FDR_BY
chromosome	SNP ID	Unadjusted p-value	Genomic-control corrected p-values	Bonferroni adjusted p-values	Holm adjusted p-values	Sidak single-step adjusted p-values	Sidak step-down adjusted p-values	Benjamini & Hochberg step-up FDR control	Benjamini & Yekutieli step-up FDR control

Step 4: multiple test corrections via permutations

Permutation procedures provide a computationally intensive approach to generating significance levels empirically. Such computations might be needed then you have doubts about normality of phenotype distributions, about fulfillment of any other test assumptions or have small sample size.

There are 2 ways to do permutations in PLINK: adaptive vs max(T).

In adaptive permutation approach, PLINK gives up permuting SNPs that are clearly going to be non-significant more quickly than SNPs that look interesting. In other words, if after only 10 permutations it sees that for 9 of these the permuted test statistic for a given SNP is larger than the observed test statistic, there is little point in carrying on, as this SNP is unlikely to ever achieve a highly significant result. This greatly speeds up the permutation procedure.
In contrast, max(T) permutation does not drop SNPs along the way. If 1000 permutations are specified, then all 1000 will be performed, for all SNPs. The benefit of doing max(T) is that two sets of empirical significance values can then be calculated - pointwise estimates of an individual SNPs significance, but also a value that controls for that fact that thousands of other SNPs were tested.

In order to run PLINK with adaptive permutations, replace --adjust with --perm, and replace with --mperm 1000 to do 1000 max(T) permutations. In case of adaptive permutations, new file will be created, with the .perm extension:

CHR	SNP	STAT	EMP1	NP
Chromosome	SNP ID	Test statistic	Empirical p-value	Number of permutations

In case of max(T) permutations, new file will be created, with the .mperm extension:

CHR	SNP	STAT	EMP1	EMP2
Chromosome	SNP ID	Test statistic	Empirical p-value (pointwise)	Corrected empirical p-value (max(T) / familywise)

Step 5: annotate significant variants

Once you found out significant variants you can extract them from the vcf you used for PLINK with vcftools and them annotate them with snpEff. It’s a command line tool:

java -jar snpEff.jar BDGP6.82 sinificant_GWAS.vcf > sinificant_GWAS.annot.vcf

The database (BDGP6.82) will be dependent on the organism

P.S. A good paper - tutorial on PLINK: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6001694/

And example usage of PLINK on HapMap project: http://zzz.bwh.harvard.edu/plink/tutorial.shtml

P.P.S. It’s really a basic tutorial, I didn't touch on population stratification, etc, mainly because I don’t do it myself.