Linkage disequilibrium (LD) - UMEcolGenetics/PawPawPulation-Genetics GitHub Wiki

What is LD?

  • Linkage disequilibrium (LD), sometimes referred to as gametic phase disequilibrium, occurs when alleles of two or more loci (or genes) have a nonrandom association in a population1. In other words, it is when two loci are considered linked, or their inheritance is correlated, potentially influencing the evolutionary trajectory of the other locus/loci, either from their physical proximity on the chromosome or functional association.

  • LD is influenced by many factors, including selection, the rate of genetic recombination, mutation rate, genetic drift, the system of mating, population structure, and genetic linkage. As a result, the pattern of LD in a genome is a powerful signal of the population genetic processes that are structuring it (Wikipedia).

  • Detecting the presence of LD in population genetics is important since it can help address questions related to natural selection and/or predict evolutionary long-term trends. For example, if alleles at two loci are in LD and affect reproductive fitness, then the response to selection on one locus might be influenced by the selection of the other.

Calculating LD

There are two methods typically used to calculate LD in population genetic studies: single-locus (pairwise) and multi-locus. Calculating single-locus LD determines the presence of LD between pairs of loci, while multi-locus LD assesses the background levels of LD within the population. Single-locus LD is central to most LD calculations, however, extending the statistics from a two-locus case to multi-locus allows for a better understanding of LD across the genome and any associations across multiple loci1,2.

There are valid concerns when calculating LD among populations, especially when the study system isn’t a diploid, sexual organism1,2, so determining when to use single-locus, multi-locus, or both LD measures can be difficult. For example, there is not a good or straightforward method to calculate LD among asexual/clonal organisms, so it has been suggested to calculate and report both single-locus and multi-locus LD3.

In this tutorial, we will go over two methods you can use to find both multi-locus and single-locus LD:

  1. The first method will help you calculate multi-locus LD by utilizing multiple functions using the R package poppr4,5. R is a free software environment commonly used for statistical computing.

  2. The second method introduces Genepop on the Web, this web-based GUI offers many population genetic estimates such as Hardy Weinberg Exact Tests, Population Differentiation, FST and other correlations. It includes methods to generate single-locus LD calculations6,7. There are other ways to calculate both forms of LD (e.g., Arlequin, GENETIX, DnaSP, etc. [refer to Mueller 20042 for full list]), but the two methods I am proposing are straightforward and will give you instant and reliable results.

This tutorial will be using raw microsatellite data8 downloaded from Dryad9 to analyze LD among populations of pawpaws (Asimina triloba). This dataset includes two population types: wild and anthropogenic. This tutorial will be combining those population types into one massive dataset containing 82 populations composed of 20 anthropogenic and 62 wild populations called 'pawpaw_FullData.xlsx'.

All files can be found in the 'LD' folder on the main page

Calculating multi-locus LD using R package poppr

The poppr package is an amazing tool that can calculate:

  • Genotypic diversity measures
  • Genetic distances with bootstrap support
  • Clone correction
  • And much more!

In this tutorial, we will focus on how to calculate multi-locus LD in both sexual and asexual individuals by measuring the Index of Association (Ia) and Standardized Index of Association (rbarD). Ia was originally developed by A.H.D. Brown and others to analyze the population structure of wild Barley9, and it has been used widely as a tool to detect clonal reproduction within populations. It is calculated based on the ratio of the variance of the raw number of differences between individuals and the sum of those variances over each locus. rbarD was later developed by Paul-Michael Agapow and Austin Burt who noticed that Ia increases steadily with the number of loci. Therefore, they developed rbarD as a way to standardize the results of Ia10. Both measures are expected to be zero if the alleles are randomly assorted, and greater than zero if there is association between alleles11,12.

Step 1: Ensure correct file format before importing data

R should be installed prior to starting this tutorial. We recommend using R Studio for a more integrated experience for image viewing. If you’re unfamiliar with R, you may want to start with some beginner R tutorials.

First, we need to import our data correctly. I uploaded my input file as a data frame with the locus ID in the first row and individual ID in the first column using Microsoft Excel. Allele counts for each individual, dependent on locus, can be found in designated row with Alleles 1 and 2 separated by ‘/’ (shown below).

(Figure 1)

I believe there are other ways you can format your Excel file, so feel free to import data as you would prefer, this method just worked best for me. This file ('pawpaw_LD.csv') can be found here.

Once your file format is correct, you should then read in the data. I prefer to read in data frames as a ‘.csv’ file; however, you can import the data in whichever way you prefer. Below is an example of uploading the data as a ‘.csv’ script:

pawpawFULL <- read.csv("pawpaw_LD.csv")

Step 2: Transform pawpawFULL format for poppr

Next, poppr wants the data to be a genind or genclone object, or any fstat, structure, genetix, genpop, or genalex formatted file. In this case, I will transform the dataset (pawpawFull) into a genind object named 'obj2.'

Two packages are required to use these commands: dplyr and adegenet.

install.packages("adegenet")
install.packages("dplyr")
library(adegenet)
library(dplyr)

# the 'ploidy' argument is set to 2 since we know it is a diploid species
# the 'sep' argument is '"/"' since alleles are separated by a slash in each column of the dataset (Figure 1)
obj2 <- df2genind(pawpawFULL, ploidy = 2, sep = "/")	

Now, your dataset pawpawFULL should be a genind object instead of a dataframe. To check:

class(obj2)

Step 3: Calculating Ia and rbarD using poppr

There are a couple of ways in poppr that you can get the Ia and rbarD values and their associated p-values:

3.1 Using the function poppr()

First, you can run the function poppr() which calculates indices of heterozygosity, evenness, and linkage. In this tutorial, we will only focus on the results of linkage, although this function automatically gives you the results for heterozygosity (e.g., H) and evenness (e.g., E.5).

install.packages("poppr")
library(poppr)

# ‘sample’ is indicative of permutations; when ‘total’ is true, indices will be calculated for the pooled populations
poppr_pawpaw <- poppr(dat=obj2, total = TRUE, sample = 999)	

The results of this function will give you a list with multiple outputs.

(Figure 2)

We are most interested in the results for the Ia value and it's p-value ('Ia' and 'p.Ia', respectively), and the rbarD value and it's p-value ('rbarD' and 'p.rd', respectively).

NOTE: poppr also gives you the option to plot the data. Just as a note, the values are the same, but the histograms are different:

hist_ia <- poppr(dat=obj2, total = TRUE, sample = 1000, index = "Ia", plot = TRUE)

# ‘sample’ is indicative of permutations; when ‘total’ is true, indices will be calculated for the pooled populations; ‘index’ indicates which result will be plotted
hist_rbarD <- poppr(dat=obj2, total = TRUE, sample = 1000, index = "rbarD", plot = TRUE)	

(Figure 3)

3.2 Using function ia()

Second, you can run the function ia() which calculates the index of association over all loci in the dataset. This command requires the data to be a genind or genclone object.

ia <- ia(gid = obj2, sample = 999, quiet = TRUE, valuereturn = TRUE) 

The results will give you a list with two outputs (shown below). The 'Index' output tells you the Ia value and its p-value ('Ia' and 'p.Ia') and the rbarD value and its p-value ('rbarD' and 'p.rd').

(Figure 4)

Calculating single-locus LD using poppr

The poppr package can also be used to find single-locus LD. To calculate single-locus LD, you will need to use the function pair.ia() which calculates the index of association in a pairwise manner among all loci.

To prepare the data for pair.ia(), you will need to follow Steps 1 and 2 from “Calculating multi-locus LD using R package poppr”, reposted and condensed below:

pawpawFULL <- read.csv("pawpaw_LD.csv")
install.packages("poppr")	# remove '#' if you do not have these packages already installed
install.packages("adegenet")	# remove '#' if you do not have these packages already installed
install.packages("dplyr")	# remove '#' if you do not have these packages already installed
library(adegenet)
library(dplyr)
library(poppr)
obj2 <- df2genind(pawpawFULL, ploidy = 2, sep = "/"

The end result should be obj2, which should now be a genind file that can be used as input for poppr. Now that your data is ready, you can run pair.ia():

res <- pair.ia(obj2, sample = 999)	# ‘sample’ indicates permutation number

The results of res show the individual Ia (p.Ia) and rbard (p.rD) values given each locus (table on left). It also produces a pairwise-heatmap of the rbarD results between all loci (heatmap on right).

(Figure 5)

If you want a pairwise-heatmap of the Ia results instead of rbarD, simply use the plot() function and indicate index as “Ia” (results not shown).

plot(res, index = "Ia")

Calculating single-locus LD using GenePop on the Web

(Figure 6)

GenePop on the Web (https://genepop.curtin.edu.au/) is the web version of GenePop, a population genetics software package that provides many population genetic test options. The web version is very easy to use, though there are some limitations to the web version, such as lacking some additional analyses provided by GenePop and only being able to process up to 50 loci and/or 100 populations. However, it is useful for teaching purposes or if you only need to run a couple of analyses and don’t want to download a new software package on your computer.

In this tutorial, I will demonstrate how to use GenePop on the Web to calculate single-locus LD. Even though I will only be showing single-locus LD calculations, GenePop on the Web has many more population genetic test options, so feel free to explore the different test options that this website provides.

Step 1: Ensure correct file format before importing data

GenePop on the Web provides their own step-by-step tutorial on how to format your datafile as input. This tutorial can be found under the “Additional Help Files” section on the homepage.

Another option to format your input file correctly is to use GenAlEx 6.514,15 in Microsoft Excel which can export your data as a GenePop file. I will not discuss how to use GenAlEx in this tutorial, but if you are familiar with GenAlEx then know this is also an option.

An example of the input data format can be found below. The file used in this tutorial, named ‘pawpaw_GenePop.txt’, can be found here.

(Figure 7)

Step 2: Calculate single-locus LD

To calculate single-locus (pairwise) LD, you should click on 'Option 2: Linkage Disequilibrium'. This will take you to a new window where you can choose suboptions and parameters.

We want to test for each pair of loci in each population (suboption 1) using either the ‘log likelihood ratio statistic’ or the ‘probability tests’. Choosing either option is dependent on your individual study system, so please refer to their help pages to determine which suboption works best for you. For this tutorial, I will use the ‘log likelihood ratio statistic’ option.

Next, you will need to choose the Markov chain parameters since we are using suboption 1. They suggest you try the default options initially and change according to your own study system. The default parameters are as follows: 1,000 dememorizations, 100 batches, and 1,000 iterations per batch.

Next, you determine how you want the output files delivered, either by email address or as a separate HTML file. For this tutorial, I will choose the HTML option.

Finally, you input your data. You can copy and paste the data into the text box provided on the bottom right of the webpage, or you can upload your file. For this tutorial, I uploaded the file.

Below is an example of all options that were chosen for this tutorial to test for single-locus LD:

(Figure 8)

Once determined that everything was correct, I clicked ‘Submit data’.

Step 3: Results from GenePop on the Web

After clicking submit, results arrived after about ten seconds.

The results of GenePop on the Web produces two tables (file with all results: GenePop_single-locusLD_Results.txt). The first table contains the population, the pair of loci that were compared (Locus#1 and Locus#2), the p-value, standard error (S.E.), and the number of switches ('A'). The second table (bottom of the resulted web page) provides the locus pair, a chi-square value, the degrees of freedom, and a p-value for each locus pair across all populations ('B').

(Figure 9)

Using either table can provide information on locus pair associations within and across all populations.

What do these results mean?

Analyzing the results of multi-locus LD

Looking at the results of these options, you should initially notice that results from both the Ia and rbarD values are exactly the same, so both options could be used to calculate multi-locus LD. The main differences between these functions come from the various arguments within each function. Therefore, determining which function to use will depend on your study system.

Another thing to notice is that both the Ia and rbarD values are above zero and are considered significant (p<0.001) when alpha is set at 0.05 (Images 2 & 4), indicating that there is an association between alleles or linkage (LD) present among nuclear markers across all populations.

Analyzing the results of single-locus LD

When calculating the results of single-locus LD, we can see that some markers appear to be in linkage with each other (Images 5 & 9). The poppr results show that the Ia and rbarD values are above zero for many paired loci (~23 out of 36 sets) and are considered significant when alpha is set to 0.05 (Image 5). GenePop on the Web found 15 paired loci to be significant (Image 9), differing from the poppr results. This difference can potentially be explained by the different methods that poppr uses compared to GenePop on the Web when calculating single-locus LD: poppr uses its default method, the Permute Alleles method (examples of other methods available in poppr: https://grunwaldlab.github.io/poppr/reference/shufflepop.html) and GenePop on the Web uses Fisher’s method. Therefore, it would be worthwhile to try both methods when measuring single-locus LD.

General

Calculating both multi-locus and single-locus LD showed that there does appear to be linkage among and across populations of papaw. Understanding the presence of linkage could then help you better understand the history and structure of your population (i.e., whether your populations are in Hardy-Weinberg Equilibrium), and how to handle the next steps of your population genetic project.

The authors did not report any measurements of LD in their paper, so we cannot compare our methods to theirs. However, we can feel comfort in the fact that the methods in this tutorial have been used in some recently published population genetic papers 1213.

Notes/additional resources:

  • To find out more information about the specific functions and their arguments, put a question mark before a function or package and run on the console. For example: ?poppr
  • Mueller 2004 2 and Slatkin 2008 1 are great papers to better understand the concept of LD, the math behind calculating LD, and the different between multi-locus and single-locus LD.
  • poppr manual: https://grunwaldlab.github.io/poppr/
  • If you have clonal organisms, this site provides a tutorial on how to do clonal corrections while using poppr: https://grunwaldlab.github.io/Population_Genetics_in_R/Linkage_disequilibrium.html

To cite poppr in publications or presentations

Please specify poppr version (in this case v.2.9.2) and the following citations:

Kamvar, Z. N., Brooks, J. C., and Grünwald, N. J. 2015. Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Frontiers in Genetics 6:208. https://doi.org/10.3389/fgene.2015.00208

Kamvar, Z. N., Tabima, J. F., and Grünwald, N. J. 2014. Poppr: An R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ 2:e281. https://doi.org/10.7717/peerj.281

To cite GenePop on the Web in publications or presentations

Please use the following citation:

Raymond, M. & Rousset, F. 1995. GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. Journal of Heredity, 86:248-249.

Rousset, F. 2008. Genepop'007: a complete reimplementation of the Genepop software for Windows and Linux. Molecular Ecolcology Resources 8: 103-106. https://doi.org/10.1111/j.1471-8286.2007.01931.x

References

[1]: Slatkin M. (2008). Linkage disequilibrium--understanding the evolutionary past and mapping the medical future. Nat Rev Genet, 9(6):477-85. https://doi.org/10.1038/nrg2361.

[2]: Mueller, J.C. (2004). Linkage disequilibrium for different scales and applications. Briefings in Bioinformatics, 5(4): 355–364. https://doi.org/10.1093/bib/5.4.355.

[3]: de Meeûs, T., and Balloux, F. (2004). Clonal reproduction and linkage disequilibrium in diploids: A simulation study. Infection, Genetics and Evolution, 4(4): 345–351. https://doi.org/10.1016/j.meegid.2004.05.002.

[4]: Kamvar, Z.N., Brooks, J.C., and Grünwald, N.J. (2015). Novel R tools for analysis of genome-wide population genetic data with emphasis on clonality. Frontiers in Genetics, 6: 208. https://doi.org/10.3389/fgene.2015.00208.

[5]: Kamvar, Z.N., Tabima, J.F., and Grünwald, N.J. (2014). Poppr: An R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ, 2:e281. https://doi.org/10.7717/peerj.281.

[6]: Raymond, M., & Rousset, F. (1995). An Exact Test for Population Differentiation. Evolution, 49(6): 1280–1283.

[7]: Rousset, F. (2008). GENEPOP’007: A complete re-implementation of the GENEPOP software for Windows and Linux. Molecular Ecology Resources, 8: 103–106. https://doi.org/10.1111/j.1471-8286.2007.01931.x.

[8]: Wyatt, G.E., Hamrick, J.L., and Trapnell, D.W. (2021). The role of anthropogenic dispersal in shaping the distribution and genetic composition of a widespread North American tree species. Ecology and Evolution, 11(16): 11515–11532. https://doi.org/10.1002/ece3.7944.

[9]: Brown, A.H.D., Feldman, M.W., and Nevo, E. (1980). Multilocus Structure of Natural Populations of Hordeum spontaneum. Genetics, 96(2): 523–536. https://doi.org/10.1093/genetics/96.2.523.

[10]: Agapow, P.-M., & Burt, A. (2001). Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1–2): 101–102. https://doi.org/10.1046/j.1471-8278.2000.00014.x.

[11]: Lott, T.J., Frade, J.P., and Lockhart, S.R. (2010). Multilocus sequence type analysis reveals both clonality and recombination in populations of Candida glabrata bloodstream isolates from U.S. surveillance studies. Eukaryotic Cell, 9(4): 619–625. https://doi.org/10.1128/EC.00002-10.

[12]: Moore, E.R., Siniscalchi, C.M., and Mandel, J.R. (2021). Reevaluating Genetic Diversity and Structure of Helianthus verticillatus (Asteraceae) after the Discovery of New Populations. Castanea, 86(2): 196–213.

[13]: Bentley, K.E., & Mauricio, R. (2016). High degree of clonal reproduction and lack of large-scale geographic patterning mark the introduced range of the invasive vine, kudzu (Pueraria montana var. lobata), in North America. American Journal of Botany, 103(8): 1499–1507. https://doi.org/10.3732/ajb.1500434.

[14]: Peakall, R. & Smouse, P.E. (2006). GENALEX 6: genetic analysis in Excel. Population genetic software for teaching and research. Molecular Ecology Notes, 6: 288-295. https://doi.org/10.1111/j.1471-8286.2005.01155.x.

[15]: Peakall, R. & Smouse, P.E. (2012). GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research--an update. Bioinformatics (Oxford, England), 28(19), 2537–2539. https://doi.org/10.1093/bioinformatics/bts460.

⚠️ **GitHub.com Fallback** ⚠️