LD tutorial performing LD around a lead SNP - pcgoddard/Burchardlab_Tutorials GitHub Wiki

Author: Jennifer Liberto
Date: 20 Nov 2017
Author contact: [email protected]

LD -- A little background

LD stands for linkage disequilibrium whose formal definition is: the non-random association of alleles at different loci in a population

When performing an LD analysis, the purpose is to find the region of indetermination, or a genomic window in which the SNP driving the association of interest is likely to reside. For the scope of this tutorial, the purpose of performing an LD analysis is slightly more focused; we are trying to find the SNPs surrounding a lead SNP (or a SNP known to be a driver of the association of interest) that are in LD with it.

The Math Behind LD

There are two calculations that will produce an estimate of linkage disequilibrium between two alleles, D' and R^2. D' ranges between -1 and 1 where the extremes indicate that at least one of the allelic haplotypes was not observed and that one marker is a good surrogate for the other. However, D' has its drawbacks in that D' estimates are inflated in small sample sizes and then one allele is rare. As such, the standard is to use R^2. R^2 ranges between 0 and 1 with 0 meaning the two markers are in perfect equilibrium and 1 meaning the markers provide identical information. Below is a mathematical breakdown of the calculation of D' and R^2. In this tutorial, PLINK performs this simple calculation between the lead SNP and all the other SNPs in the specified window.

  1. start with two SNPs of unknown equilibrium
    one

  2. Assign allele frequencies
    two

  3. Assign haplotype frequencies
    three

  4. In perfect equilibrium, the haplotype frequencies will equal the product of the allele frequencies as shown in the formulas below.
    four

  5. In disequilibrium, the haplotype frequencies will only equal the product of the allele frequencies with the addition of a disequilibrium constant, D, as shown in the formulas below.
    five

  6. To find D the formula is as follows:
    six

  7. to find D' use the following logic and formula
    seven

  8. to find R2 the formula is as follows:
    eight

  • Remember, when R^2 is closer to 0 the two alleles are closer to equilibrium than disequilibrium.

General Practices:

  • The standard threshold to determine if two markers are in disequilibrium is R^2 >= 0.8. If no results are popping up, set the threshold to no lower than 0.5.
  • You should limit the window of your search for SNPs in LD to +/- 1Mb away from your lead SNP.

Performing LD with PLINK

PLINK is a very user-friendly and easy to learn program. Resources about PLINK can be found at:

To perform a lead SNP LD analysis use the following command:

plink --bfile <bedfile prefix> --chr <chr #> --from-bp <-1Mb from lead SNP> --to-bp <+1Mb from lead SNP> --r2 --ld-window-r2 0.8 --ld-snp <lead SNP id> --out <outfile prefix>

You will get an output of the following:

Column Name Description
CHR_A Chromosome code for the first variant (the lead SNP)
BP_A Base pair coordinate of the first variant
SNP_A ID of the first variant
CHR_B Chromosome code for the second variant
BP_B Base pair coordinate of the second variant
SNP_B ID of the second variant
R2 Correlation coefficient

Unfortunately, the output isn't formatted well because the output separates columns by a series of spaces. This is not preferable if you intend to perform further analysis with the results i.e. an association analysis. To correct for this issue, use the following command to output a CSV file.

cat <outfile prefix>.ld | perl -pe 's/^\s+//g' | perl -pe 's/\v/\n/g' | perl -pe 's/\h+/\,/g' | perl -pe 's/,$//g' > <outfile prefix>.cleaned.csv

IT IS IMPORTANT TO NOTE THAT YOUR LD ANALYSIS SHOULD BE ETHNICALLY STRATIFIED that is, perform this analysis for each of the three populations (Puerto Rican, Mexican, and African American).

A "meta LD analysis" can be performed by using the max and min positions of the union of the results from the three separate LD runs. For example, say you have the following results from your ethnically-stratified LD analysis:

# AF
CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
19     18141996   19:18141996:C:T     19     18141950   19:18141950:T:C     0.902143
19     18141996   19:18141996:C:T     19     18141996   19:18141996:C:T            1
19     18141996   19:18141996:C:T     19     18142423   19:18142423:T:C     0.894456
19     18141996   19:18141996:C:T     19     18142682   19:18142682:T:C     0.858933 

# PR
CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
19     18141996   19:18141996:C:T     19     18141950   19:18141950:T:C     0.901888
19     18141996   19:18141996:C:T     19     18141996   19:18141996:C:T            1
19     18141996   19:18141996:C:T     19     18142423   19:18142423:T:C     0.901811
19     18141996   19:18141996:C:T     19     18142682   19:18142682:T:C      0.87746 

# MX
 CHR_A         BP_A             SNP_A  CHR_B         BP_B             SNP_B           R2
19     18141996   19:18141996:C:T     19     18141950   19:18141950:T:C     0.969005
19     18141996   19:18141996:C:T     19     18141996   19:18141996:C:T            1
19     18141996   19:18141996:C:T     19     18142075   19:18142075:G:A      0.66809
19     18141996   19:18141996:C:T     19     18142423   19:18142423:T:C     0.968828

Your "meta LD analysis" should limit the window to search for SNPs in LD with the lead SNP from 18141950 to 18142682

Troubleshooting

If your SNP of interest is not found in your bedfiles, this could be a result of it not passing QC filters. Use PLINK to extract your SNP of interest from the raw bedfiles and merge it into the filtered bedfiles (make sure to name the merged file something different so as not to overwrite the previous file). The following commands are a nice guide to help with this step. Remember everything has to be ethnically stratified in order for this to work properly

# extract
plink --bfile <raw plink file prefix> --keep <ethnically stratified filtered fam file> --extract <SNP id to extract> --make-bed --out <outfile extract prefix>

# merge
plink --bfile <ethnically stratified filtered bedfiles prefix> --bmerge <outfile extract prefix>.bed <outfile extract prefix>.bim <outfile extract prefix>.fam --make-bed --out <outfile merge prefix>

Okay that's it!! Good luck and don't hesitate to contact me with any questions/issues that might arise during this process for you.

⚠️ **GitHub.com Fallback** ⚠️