LD tutorial performing LD around a lead SNP - pcgoddard/Burchardlab_Tutorials GitHub Wiki
Author: Jennifer Liberto
Date: 20 Nov 2017
Author contact: [email protected]
LD stands for linkage disequilibrium whose formal definition is: the non-random association of alleles at different loci in a population
When performing an LD analysis, the purpose is to find the region of indetermination, or a genomic window in which the SNP driving the association of interest is likely to reside. For the scope of this tutorial, the purpose of performing an LD analysis is slightly more focused; we are trying to find the SNPs surrounding a lead SNP (or a SNP known to be a driver of the association of interest) that are in LD with it.
There are two calculations that will produce an estimate of linkage disequilibrium between two alleles, D' and R^2. D' ranges between -1 and 1 where the extremes indicate that at least one of the allelic haplotypes was not observed and that one marker is a good surrogate for the other. However, D' has its drawbacks in that D' estimates are inflated in small sample sizes and then one allele is rare. As such, the standard is to use R^2. R^2 ranges between 0 and 1 with 0 meaning the two markers are in perfect equilibrium and 1 meaning the markers provide identical information. Below is a mathematical breakdown of the calculation of D' and R^2. In this tutorial, PLINK performs this simple calculation between the lead SNP and all the other SNPs in the specified window.
-
start with two SNPs of unknown equilibrium
-
Assign allele frequencies
-
Assign haplotype frequencies
-
In perfect equilibrium, the haplotype frequencies will equal the product of the allele frequencies as shown in the formulas below.
-
In disequilibrium, the haplotype frequencies will only equal the product of the allele frequencies with the addition of a disequilibrium constant, D, as shown in the formulas below.
-
To find D the formula is as follows:
-
to find D' use the following logic and formula
-
to find R2 the formula is as follows:
- Remember, when R^2 is closer to 0 the two alleles are closer to equilibrium than disequilibrium.
- The standard threshold to determine if two markers are in disequilibrium is R^2 >= 0.8. If no results are popping up, set the threshold to no lower than 0.5.
- You should limit the window of your search for SNPs in LD to +/- 1Mb away from your lead SNP.
PLINK is a very user-friendly and easy to learn program. Resources about PLINK can be found at:
To perform a lead SNP LD analysis use the following command:
plink --bfile <bedfile prefix> --chr <chr #> --from-bp <-1Mb from lead SNP> --to-bp <+1Mb from lead SNP> --r2 --ld-window-r2 0.8 --ld-snp <lead SNP id> --out <outfile prefix>
You will get an output of the following:
| Column Name | Description |
|---|---|
| CHR_A | Chromosome code for the first variant (the lead SNP) |
| BP_A | Base pair coordinate of the first variant |
| SNP_A | ID of the first variant |
| CHR_B | Chromosome code for the second variant |
| BP_B | Base pair coordinate of the second variant |
| SNP_B | ID of the second variant |
| R2 | Correlation coefficient |
Unfortunately, the output isn't formatted well because the output separates columns by a series of spaces. This is not preferable if you intend to perform further analysis with the results i.e. an association analysis. To correct for this issue, use the following command to output a CSV file.
cat <outfile prefix>.ld | perl -pe 's/^\s+//g' | perl -pe 's/\v/\n/g' | perl -pe 's/\h+/\,/g' | perl -pe 's/,$//g' > <outfile prefix>.cleaned.csv
IT IS IMPORTANT TO NOTE THAT YOUR LD ANALYSIS SHOULD BE ETHNICALLY STRATIFIED that is, perform this analysis for each of the three populations (Puerto Rican, Mexican, and African American).
A "meta LD analysis" can be performed by using the max and min positions of the union of the results from the three separate LD runs. For example, say you have the following results from your ethnically-stratified LD analysis:
# AF
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
19 18141996 19:18141996:C:T 19 18141950 19:18141950:T:C 0.902143
19 18141996 19:18141996:C:T 19 18141996 19:18141996:C:T 1
19 18141996 19:18141996:C:T 19 18142423 19:18142423:T:C 0.894456
19 18141996 19:18141996:C:T 19 18142682 19:18142682:T:C 0.858933
# PR
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
19 18141996 19:18141996:C:T 19 18141950 19:18141950:T:C 0.901888
19 18141996 19:18141996:C:T 19 18141996 19:18141996:C:T 1
19 18141996 19:18141996:C:T 19 18142423 19:18142423:T:C 0.901811
19 18141996 19:18141996:C:T 19 18142682 19:18142682:T:C 0.87746
# MX
CHR_A BP_A SNP_A CHR_B BP_B SNP_B R2
19 18141996 19:18141996:C:T 19 18141950 19:18141950:T:C 0.969005
19 18141996 19:18141996:C:T 19 18141996 19:18141996:C:T 1
19 18141996 19:18141996:C:T 19 18142075 19:18142075:G:A 0.66809
19 18141996 19:18141996:C:T 19 18142423 19:18142423:T:C 0.968828
Your "meta LD analysis" should limit the window to search for SNPs in LD with the lead SNP from 18141950 to 18142682
If your SNP of interest is not found in your bedfiles, this could be a result of it not passing QC filters. Use PLINK to extract your SNP of interest from the raw bedfiles and merge it into the filtered bedfiles (make sure to name the merged file something different so as not to overwrite the previous file). The following commands are a nice guide to help with this step. Remember everything has to be ethnically stratified in order for this to work properly
# extract
plink --bfile <raw plink file prefix> --keep <ethnically stratified filtered fam file> --extract <SNP id to extract> --make-bed --out <outfile extract prefix>
# merge
plink --bfile <ethnically stratified filtered bedfiles prefix> --bmerge <outfile extract prefix>.bed <outfile extract prefix>.bim <outfile extract prefix>.fam --make-bed --out <outfile merge prefix>
Okay that's it!! Good luck and don't hesitate to contact me with any questions/issues that might arise during this process for you.







