Estimating Population Structure using STRUCTURE - UMEcolGenetics/PawPawPulation-Genetics GitHub Wiki

Introduction

Population structure (or genetic structure) refers to differences in allele frequencies between populations due to non-random mating between individuals. Non-random mating may be caused by inbreeding, assortative mating, and separation due to distance and geographic barriers. In these circumstances, genetic and phenotypic differences can accumulate between populations. Another consequence of population structure is reduced heterozygosity, as alleles have a higher chance to reach fixation within smaller subpopulations.

Studying population structure is important for understanding patterns of gene flow and inferring the demographic histories of populations. Once population structure for a species is known, it is possible to deduce the origins of unknown individuals and infer the presence of migration and admixture between subpopulations.

STRUCTURE is a freely available program used to investigate population structure. This includes inferring the presence of distinct populations, assigning individuals to these distinct populations, and identifying admixture between populations. It can be applied to most genetic markers, such as RFLPS, microsatellites, and SNPs. The original paper describing STRUCTURE can be found here1. In this tutorial, we will show you how to perform the entire STRUCTURE pipeline, from preprocessing input data, to visualizing evidence of population structure in your study species.

Input Data

We will perform our STRUCTURE analysis using a dataset of microsatellite genotypes obtained from multiple pawpaw (Asimina triloba) populations found across the United States. The original study that produced and analyzed this dataset can be found here2. The microsatellite data is available on Dryad (https://doi.org/10.5061/dryad.5x69p8d3g).

We have converted the original data into a format that is readable by the software package GenAlEx, which provides options for population genetic analyses within Microsoft Excel. The original paper describing the latest version of GenAlEx (v6.5) can be found here3. We have made this reformatted data file available for this tutorial (see Asimina_triloba_GenAlEx.xlsx in example files).

One of the benefits of using GenAlEx is it ability to easily export data that is correctly formatted for input into other programs, such as STRUCTURE. This can be done by pressing the Import-Export option within GenAlEx / Microsoft Excel, followed by Export, then Structure. In the Option section of the resulting pop-up window, select Use Pop, then press OK.

(Figure 1) (Figure 2)

We have provided the resulting STRUCTURE input file for this tutorial (see Asimina_triloba_STRUCTURE_input.txt in example files).

Running STRUCTURE

There are two ways to run STRUCTURE – via a graphical user interface (GUI) or via command line. The GUI is more user friendly and is easier to use when including different numbers of population clusters (K) in your analysis. Thus, the GUI will be the focus of this tutorial.

Once you have downloaded and opened STRUCTURE, the first thing you will need to do is start a new project by selecting File, then New Project. The program will then ask you to name your project and specify where to put output files, and where to find the input file. In the next pop-up window, STRUCTURE will ask for more information about the input dataset. For our example pawpaw data, there are 732 individuals, ploidy is 2, there are 9 loci, and missing data is coded as -9.

(Figure 3)

In the third and fourth pop-up windows, there are further questions about the format of the input file. For the example data, there is a row of marker names, data is stored in a single line for each individual, there is an ID for each individual, and the putative population of original for each individual is included. You should also select the option for USEPOPINFO selection flag (this is because we selected the Use Pop option when exporting from GenAlEx).

(Figure 4) (Figure 5)

The next step is to create a new parameter set by selecting Parameter Set then New. Because STRUCTURE uses a Markov Chain Monte Carlo (MCMC) approach, we need to tell STRUCTURE how many iterations of the run we want to discard as burn-in, and how may iterations we want to use after this burn-in period. For this tutorial, we will set the length of burn-in to 20,000 iterations, and the number of iterations after this burn-in to 1,000,000. Generally, using a long burn-in and then many subsequent iterations is good, but bear in mind that the greater the number of iterations, the longer STRUCTURE will take to run.

(Figure 6)

Now that the input data is loaded and the parameter set has been constructed, we are ready to run STRUCTURE. This can be done by selecting Project, then Start a Job. In the pop-up window, select the parameter set you just made. Then choose the range of clusters (K) that you want STRUCTURE to fit to your data. For the tutorial, we will run STRUCTURE through Ks 1 to 10. The next step is to choose the number of times that STRUCTURE will run for each K. To reduce runtime, we will use 5 iterations for each K, but you may prefer to choose 10 or more for your own analyses.

(Figure 7)

Click Start, and STRUCTURE will begin running the analyses. When running STRUCTURE with the options we selected (5 iterations for Ks from 1 to 10, with a burn-in of 20,000 iterations followed by another 1,000,000 iterations), it will take almost 24 hours for STRUCTURE to finish running.

Determining the true K

Once STRUCTURE has finished, we need to identify which degree of clustering (the value of K) that best fits our data. You will find a Results folder in the directory you specific for STRUCTURE output files. Compress this Results folder into a .zip file. We have provided a compressed Results file for you (see Asimina_triloba_STRUCTURE_Results.zip in example files). This .zip file will be used as input for a web-based program known as STRUCTURE HARVESTER.

STRUCTURE HARVESTER provides an easy way to visualize likelihood values across multiple values of K to find the number of genetic groups that best fit the data. The original paper describing STRUCTURE HARVESTER can be found here4.

Interpreting the results of STRUCTURE to identify the correct value for K is not straightforward. Simply looking for the model with the best log probability does not lead to a correct estimation of K. Instead, a statistic known as ΔK, which is based on the rate of change in log probability between successive values of K, can be used to more accurately infer the correct number of genetic groups. Using ΔK is known as the Evanno method, after the first author of the paper that originally described it (found here5).

STRUCTURE HARVESTER plots ΔK, allowing us to choose the K that maximizes this values. In our example, K=2 fits the data best.

(Figure 8)

Now that we have identified K=2 as the value that best fits our data, we can download the results pertaining to K=2 from the STRUCTURE HARVESTER webpage.

(Figure 9)

We have provided example outputs from STRUCTURE HARVESTER for this tutorial (see K2.indfile and K2.popfile in example files).

Cluster matching using CLUMPP

Before we can visualize the genetic clustering using the best-fitting K value, there is an additional step of processing. Between different iterations of a given K value in STRUCTURE, the clusters may be labeled differently. For example, with K=2, one cluster may be labeled as Cluster 1 in the first iteration, but then be considered as Cluster 2 in the next run.

CLUMPP is a program used to align multiple replicate analyses of the same data. It then assigns a membership coefficient to each individual based on the results of each replicate analysis. This membership coefficient can be interpreted as the probability of membership to a given cluster, or as the fraction of the genome with membership to a cluster. The original paper describing CLUMPP can be found here6.

From STRUCTURE HARVESTER, you downloaded an indfile and popfile for the best-fitting K value. The indfile contains membership coefficients for each individual in each run (remember that we ran STRUCTURE five times for each value of K, so there should be five different membership coefficients for each individual throughout this file). The popfile contains membership coefficients for each population in each run. These are input files for CLUMPP.

Once you have downloaded CLUMPP, you will find a paramfile and a CLUMPP.exe within the CLUMPP directory. The paramfile sets the parameters for running CLUMPP. Copy these files to the directory containing your indfile and popfile. Then, you will need to edit the copied paramfile so that the parameters align with your data. The main parameters in the paramfile are explained below:

  • DATATYPE (int): Type of datafile to be used. If DATATYPE = 0, CLUMPP expects to read an indfile, and if DATATYPE = 1, CLUMPP expects to read a popfile.
  • INDFILE (string): The name of the indfile (maximum length of 50 characters. The indfile is required if DATATYPE = 0.
  • POPFILE (string): The name of the popfile (maximum length of 50 characters). The popfile is required if DATATYPE = 1.
  • OUTFILE (string): The name of the output file (maximum length of 50 characters). This file contains the mean membership coefficients over all runs after the optimal permutation has been identified
  • K (int): Number of clusters. In our pawpaw example, this is 2.
  • C (int): Number of individuals/populations. This value has to refer to individuals is DATATYPE = 0, and to popualtions is DATATYPE = 1. In our pawpaw example, the number of individuals was 732 and the number of populations was 82.
  • R (int): Number of runs. In our tutorial, we used 5 runs for each K.

An example paramfile has been provided for this tutorial (see paramfile in example files).

To run CLUMPP in Unix or MacOS, the program is executed by typing:

./CLUMPP paramfile

In Windows, you can double-click the CLUMPP.exe file to run the program. The CLUMPP.exe must be in the same folder as the paramfile, indfile, and popfile. You will need to run CLUMPP twice, first for the indfile (DATATYPE = 0), and then for the popfile (DATATYPE = 1). We have provided example output files from CLUMPP for this tutorial (see K2_ind.outfile and K2_pop.outfile in example files).

Graphical display of STRUCTURE results

After running CLUMPP, we are ready to visualize our results using DISTRUCT. The original paper describing DISTRUCT can be found here7.

Once you have downloaded DISTRUCT, you will find a drawparams files and several files to run DISTRUCT on different operating systems (e.g., distructLinux1.1, distructWindows1.1.exe, etc.). The drawparams file sets the parameters for running DISTRUCT. Copt drawparams and the correct DISTRUCT file to the directory containing your CLUMPP results. Then, you will need to edit the copied drawparams file so that the parameters align with your data. The main parameters in the drawparams file are explained below:

  • INFILE POPQ (string): Name of population input file.
  • INFILE INDIVQ (string): Name of individual input file. This file is required if PRINT INDIVS=1.
  • INFILE LABEL ATOP (string): Name of input file for list of labels to be printed atop the figure. This file is optional, but is expected if PRINT LABEL ATOP=1.
  • INFILE LABEL BELOW (string): Name of input file for list of labels to be printed below the figure. This file is optional, but is expected if PRINT LABEL BELOW=1.
  • OUTFILE (string): Name of output file.
  • K (int): Number of clusters.
  • NUMPOPS (int): Number of populations.
  • NUMINDS (int): Number of individuals.
  • PRINT INDIVS (Boolean): 1 to plot the individuals, 0 to plot the populations.
  • PRINT LABEL ATOP (Boolean): 1 to print labels atop the figure. If INFILE LABEL ATOP cannot be found, the default is to print the population codes.
  • PRINT LABEL BELOW (Boolean): 1 to print labels below the figure. If INFILE LABEL BELOW cannot be found, the default is to print the population codes.
  • PRINT SEP (Boolean): 1 to print black vertical lines to separate the results for different populations.

An example drawparams file is available for this tutorial (see drawparams in example files).

In this tutorial, we are interested in printing the population names above the figure. A file containing an ordered list of population names is available for this tutorial (see K2_pop_names.txt in example files).

There are additional parameters that can be altered to change the appearance of the resulting figure. The best values for these parameters will differ between analyses as the number of individuals and populations will differ.

To run DISTRUCT in Unix or MacOS, the program is executed by typing

./distruct

In Windows, you can double-click on distructWindows1.1.exe to run DISTRUCT. The output of DISTRUCT will be a PostScript file containing the final figure showing the results of the STRUCTURE analysis. The PostScript output file may not be recognized by your computer, but it is readable by some programs such as Adobe Illustrator or Photoshop. Alternatively, there are web-based tools that can convert it to other formats. You can use these tools to convert it to an .svg file, which is then readable by vector image editing software like Adobe Illustrator or Inkscape (which is free).

(Figure 10)

Combining STRUCTURE results with geographic data

For many researchers, it is not enough to produce the final STRUCTURE bar chart. It is more useful to plot this data on a map to show the geographical patterns of popualtion structure. Below, we provide a method of plotting the STRUCTURE results for the example pawpaw data onto a partial map of the United States. This method requires the program R.

To plot your results on a map, you will need geographic coordinates for each population. You will also need to extract the membership coefficients for each population from your CLUMPP output. Example files are included for this tutorial (see Asimina_triloba_coords.csv and Asimina_triloba_admix_proportions.csv in example files).

library(maps); library(usmap); library(plotrix); library(ggplot2); library(scatterpie); library(rgdal)

gps<-read.csv("Asimina_triloba_coords.csv",header=TRUE) # read coordinate file

gps2 <- gps[,c(3,2,1)] # rearrange latitude and longitude for usmap transformation

gps_transformed<-usmap_transform(gps2) # transform coordinates file using usmap package

admix<-read.csv("Asimina_triloba_admix_proportions.csv",header=TRUE) # read admixture q-matrix file

gps_admix_combined <- merge(gps_transformed,admix, by="pop") # combine coordinates and admixture file

# plot points on map, using only US states with sampled populations
plot_usmap(include = c("FL", "LA", "TN", "NC", "KY", "VA", "OH", "MD", "PA", "IN", "IL", "MO", "MS", "AR", "AL", "GA", "SC", "KS", "TX", "OK")) +
  geom_scatterpie(aes(x=lon.1, y=lat.1, group=pop), data=gps_admix_combined,
                cols=c("K1","K2")) + coord_equal() + scale_fill_manual(values=c("#FF9C52", "#0099E6"))

(Figure 11)

We can infer from this result that pawpaw populations in the United States form two distinct genetic clusters. One cluster (blue) has a southeastern distribution, and the other (orange) occupies the remainder of the range.

References

[1]: Pritchard, J.K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2): 945-959. https://doi.org/10.1093/genetics/155.2.945.

[2]: Wyatt, G.E., Hamrick, J.L., and Trapnell, D.W. (2021). The role of anthropogenic dispersal in shaping the distribution and genetic composition of a widespread North American tree species. Ecology and Evolution, 11(16): 11515-11532. https://doi.org/10.1002/ece3.7944.

[3]: Peakall, R. & Smouse, P.E. (2012). GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research--an update. Bioinformatics (Oxford, England), 28(19), 2537–2539. https://doi.org/10.1093/bioinformatics/bts460.

[4]: Earl, D.A., & vonHoldt, B.M. (2012). STRUCTURE HARVESTER: a website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources, 4: 359-361. https://doi.org/10.1007/s12686-011-9548-7.

[5]: Evanno, G., Regnaut, S., and Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Molecular Ecology, 14(8): 2611-2620. https://doi.org/10.1111/j.1365-294X.2005.02553.x.

[6]: Jakobsson, M., & Rosenberg, N.A. (2007). CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics, 14(8): 2611-2620. https://doi.org/10.1093/bioinformatics/btm233.

[7]: Rosenberg, N.A. (2004). DISTRUCT: a program for the graphical display of population structure. Molecular Ecology Notes, 4(1): 137-138. https://doi.org/10.1046/j.1471-8286.2003.00566.x.

⚠️ **GitHub.com Fallback** ⚠️