population_structure - BGIGPD/BestPractices4Pathogenomics GitHub Wiki

Welcome to the population_structure wiki

Workshop: Population structure

Backgroud information

The aim is to explore the population structure of parasites/vectors across various geographic regions through genomic analysis. We will explore different ways of assessing population structure (differentiation). You can think of population structure as identifying clusters or groups of more closely related individuals among these groups. Examining population structure can give us a great deal of insight into the history and origin of populations. Populations can be studied to determine if they are structured by using, for example, principal components analysis (PCA)， clustering tools (e.g. admixture) or phylogenetic trees，population differentiation summary statistics (e.g. Fst). These techniques answer questions about how genetic diversity is organized within and across populations, providing insights into evolutionary relationships and potential adaptation to local environmental pressures.

PCA

First of all we will investigate population structure using principal components analysis.

• A dimensionality reduction technique widely used in data analysis, machine learning, and statistics.

• It’s a powerful tool for visualizing and simplifying high-dimensional data while preserving essential information.

• PCA achieves this by transforming the original data into a new coordinate system where the axes are called principal components，with each axis independent of the next (i.e. there should be no correlation between them).

In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes, it's model-free method and is typically simple to apply and relatively easy to interpret.

Admixture

An admixture model reveal how genetic material moves between populations, often correlating with geographic proximity. ADMIXTURE is a clustering software with the aim to infer populations and individual ancestries. Admixture is a very useful and popular tool to analyse SNP data. It performs an unsupervised clustering of large numbers of samples and allows each individual to be a mixture of clusters.

Genetic admixture occurs when previously isolated populations interbreed resulting in a population that is descended from multiple sources

Fst

Some Genes exhibited high differentiation between populations, indicative of adaptive changes. Detected through fixation index (FST) values and selection metrics across populations, highlighting the role of environmental pressures in shaping genetic diversity. Fst can be interpreted as measuring how much closer two individuals from the same subpopulation are, compared to the total population. Fst ranges from 0 to 1. Populations that share many SNPs and have similar allele frequencies have low Fst (close to 0). Populations with many SNP differences between them will have high Fst (close to 1).

Objectives

Now that we have a fully filtered VCF, we can start do some cool analyses with it.

PCA

Learn how to implement PCA of population structure from a VCF file by PLINK

Learn how to write R code to visualize PCA output

Admixture

Learn how to implement admixture of population structure from a VCF file by admixture

Learn how to write R code to visualize Admixture output

Fst

How to estimate Fst from a VCF file by vcftools

Softwares and Databases

plink2

admixture

vcftools

bcftools

Rstudio

Steps

PCA

Install plink2

conda install bioconda::plink2

Running plink2

（1）Preparing files in plink The first thing we need to do is made a bed file and the associated plink format files using the following command in plink

plink2 --vcf random_50.vcf.gz --make-bed --allow-extra-chr --out A_alb

## the demo file **random_50.vcf.gz** is in /home/zhaohailong/population_structure2

We will be using the demo vcf file I have prepared, which contains the SNP calls from 50 individuals. The make bed command means that the vcf file information will be used to create a bed file and plink specific files it needs to perform other functions.

You should now have files with a suffix bed,fam,ped and bim.

A_alb.bed - - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.
A_alb.bim - a map file (i.e. information file) of the variants contained in the bed file.
A_alb.fam - a map file for the individuals contained in the bed file.

（2）Generating eigenvalues

Using the files plink generated, we can now generate the eigenvalues

plink2 --bfile A_alb --pca --allow-extra-chr --out A_alb
## –bfile means the prefix of needed files generated by (1) step
## –pca command means generate eigenvalues

civ_eth.eigenval - the eigenvalues from our analysis
civ_eth.eigenvec- the eigenvectors from our analysis

The eigenval tells you in order of each PC ( so PC1,PC2….) the percentage each eigenvalue contributes to the variance. The eigenvec contains the coordinates for each sample. PC1 being the most explanatory PC of the data.

For instance if PC1 explains 61%, this means that all of the other PCs most account for 39% of the variance observed in the data. PC2 will have the next largest contribution to the genetic variance, for example PC2 may contribute 20%. The overall variance should add up to 100%.

Tranfer A_alb.eigenvec on server to local computer

See previous course (https://github.com/BGIGPD/BestPractices4Pathogenomics/wiki/VirusGenesAnnotation) for guide if you forget it.

Plotting the PCA output

Next we turn to R to plot the analysis we have produced!

library(ggplot2)
eigenvec <- read.table("/Users/zhaohailong/Desktop/A_alb.eigenvec", header = F)
colnames(eigenvec) <- c("FID", "IID", paste0("PC", 1:(ncol(eigenvec) - 2)))
# Assign FID based on conditions, randomly assign each sample to some groups (in your future data, this group is likely to be geographic locations)
eigenvec$FID <- ifelse(eigenvec$IID == "M19SYYW1343", "A",
                       ifelse(eigenvec$IID == "21HNDZ8YW1003", "B", "C"))
eigenval <- scan("/Users/zhaohailong/Desktop/A_alb.eigenval")
variance_explained <- eigenval / sum(eigenval) * 100
variance_labels <- paste0("PC", 1:2, " (", round(variance_explained[1:2], 2), "%)")
# Plot PC1 vs PC2
ggplot(eigenvec, aes(x = PC1, y = PC2)) +
  geom_point(aes(color = as.factor(FID))) +  # Optional: color points by FID
  labs(x = variance_labels[1], y = variance_labels[2]) +
  ggtitle("PCA Plot of mosquito/parasite Samples") +
  theme_minimal() +
  theme(legend.title = element_blank())

Admixture

Install

conda install bioconda::admixture

Admixture analysis

（1）Generating the input file

We have already generated the input file in plink format in previous PCA training. But admixture does not accept chromosome names that are not human chromosomes. We will thus just exchange the first column by 0.

awk '{$1="0";print $0}' A_alb.bim > A_alb.bim.tmp
head A_alb.bim.tmp
mv A_alb.bim.tmp A_alb.bim

（2）Running Admixture

The basic syntax is dead-easy:

admixture --cv $INPUT.bed $K >log2.out

Here $K is a number indicating the number of clusters you want to infer.

admixture --cv A_alb.bed 3 >admixture.log3.out

It produces two files with endings .$K.Q and .$K.P

.Q which contains cluster assignments for each individual

.P which contains for each SNP the population allele frequencies.

（3）How do we know what number of clusters K to use? We should run several numbers of clusters, e.g. all numbers between K=3 and K=12 for a start. Then, admixture has a built-in method to evaluate a “cross-validation error” for each K. Computing this “cross-validation error” requires simply to give the flag --cv right after the call to admixture. Let’s now run it in a for loop with K=3 to K=5 and direct the output into log files

for i in {3..5}
 do
 admixture --cv $FILE.bed $i > log${i}.out
done

If things run successfully, you should now have a .Q and a .P file in your output directory for every K that you ran.

To identify the best value of k clusters which is the value with lowest cross-validation error, we need to collect the cv errors. Below are three different ways to extract the number of K and the CV error for each corresponding K. There are many ways to achieve the same thing in bioinformatics!

awk '/CV/ {print $3,$4}' *out | cut -c 4,7-20 > $FILE.cv.error
grep "CV" *out | awk '{print $3,$4}' | sed -e 's/(//;s/)//;s/://;s/K=//'  > $FILE.cv.error
grep "CV" *out | awk '{print $3,$4}' | cut -c 4,7-20 > $FILE.cv.error

Visualize output in R

The demo vcf file we used only have 50 samples and the fst are too small, in order to visualize population structure more clearly , here we'll use another admixture output file which with more samples.

Here is the code for making the typical ADMIXTURE-barplot for K=3:

# Load data
Q_data <- read.table("/Users/zhaohailong/Desktop/admixture.Aedes_albopictus_output.3.Q", header = FALSE)
# Rename columns to represent clusters
colnames(Q_data) <- c("Cluster1", "Cluster2", "Cluster3")
# Add an individual identifier
Q_data$Individual <- 1:nrow(Q_data)
# Reshape data for ggplot2
Q_data_long <- melt(Q_data, id.vars = "Individual", variable.name = "Cluster", value.name = "Proportion")
# Create bar plot，by default, the bar plot in the code above will display individuals in the order they appear in the .Q file. 
ggplot(Q_data_long, aes(x = factor(Individual), y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity",width = 1) +
  labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (K=3)") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),  # Remove x-axis labels for clarity
        axis.ticks.x = element_blank())

Perform hierarchical clustering based on ancestry proportions

distance_matrix <- dist(Q_data[, 1:3])  # Only use Cluster columns for clustering
cluster_order <- hclust(distance_matrix)$order
Q_data <- Q_data[cluster_order, ]
Q_data$Individual <- factor(Q_data$Individual, levels = Q_data$Individual)
Q_data_long <- melt(Q_data, id.vars = "Individual", variable.name = "Cluster", value.name = "Proportion")
# Plot with clustering-based order
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity", width = 1) +
  labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (Clustered Order)") +
  theme_minimal() +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

Randomly assign each individual to one of five regions

regions <- c("Region1", "Region2", "Region3", "Region4", "Region5")
Q_data$Region <- sample(regions, nrow(Q_data), replace = TRUE)
# Reorder data by Region for grouping in the plot
Q_data <- Q_data[order(Q_data$Region), ]
Q_data$Individual <- factor(Q_data$Individual, levels = Q_data$Individual)
# Reshape data for plotting
Q_data_long <- melt(Q_data, id.vars = c("Individual", "Region"), variable.name = "Cluster", value.name = "Proportion")
# Plot the barplot grouped by Region
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity", position = "stack", width = 1) +  # Set width to 1
  facet_grid(~ Region, scales = "free_x", space = "free_x") +
  labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot by Region") +
  theme_minimal() +
  theme(axis.text.x = element_blank(), 
        axis.ticks.x = element_blank(),
        strip.text = element_text(size = 10))

Change different color

library(RColorBrewer)
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
  geom_bar(stat = "identity",position="stack", width = 1, alpha = 0.8) + #alpha is ratio of transparency
  facet_grid(~ Region, scales = "free_x", space = "free_x") +
  scale_fill_brewer(palette = "Set3") +  # use the Set 3 color schemes of RColorBrewer
  #OR# scale_fill_manual(values = c("#FF6B6B", "#4ECDC4", "#556270")) +  # assign color by yourself
  labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (K=3)") +
  theme_minimal() +
  theme(axis.text.x = element_blank(),
        axis.ticks.x = element_blank())

Fst

Assessing genetic diversity almost always starts with an analysis of a parameter such as FST.

Install vcftools

conda install bcftools
# but when you encounter this kind of error
# bcftools: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory
Solution : Use Conda to Install GSL and bcftools
You can install both bcftools and its dependencies in an isolated environment.
conda create -n bcf -c bioconda -c conda-forge bcftools
conda activate bcf

Running vcftools

Fst is a statistical measure that tells us how different are two populations at the genetic level. It can be calculated for a single site, over a window (average across sites in the window), or across the entire genome.

Firstly, we randomly split the samples from a VCF file into two groups for Fst calculation.

(1) Extract the sample list from the VCF file and saves them to all_samples.txt
bcftools query -l random_50.vcf.gz > all_samples.txt

(2) Shuffle the sample list randomly
shuf all_samples.txt -o shuffled_samples.txt
#randomly shuffles the sample list from all_samples.txt and outputs the result to shuffled_samples.txt. This step ensures the samples are randomly ordered before splitting

(3) Split the shuffled sample list into two files
# The `split` command will divide the shuffled file into two parts (approximately equal)
split -n l/2 shuffled_samples.txt population_list

Now, you can use these two files (population_lista and population_listb) as input for vcftools to calculate Fst values between the two randomly generated groups.

(4) Fst Calculation

vcftools --gzvcf random_50.vcf.gz --weir-fst-pop population_listaa --weir-fst-pop population_listab --out popa_vs_popb_FST

## --weir-fst-pop specify a file that contain lists of individuals (one per line) that are members of a population. 
## The function will work with multiple populations if multiple --weir-fst-pop arguments are used.

Or we can calculate the fst in each genomic window

vcftools --gzvcf random_50.vcf.gz --weir-fst-pop population_listaa --weir-fst-pop population_listab --fst-window-size 10000 --out popa_vs_popb_FST_window

## --fst-window-size indicate the size of the window in base pairs.

The first line of the files says that: In this region there are 13 variants and the weighted avearge Fst is 0.00029

WEIR_AND_COCKERHAM_FST: Site-specific Fst using Weir and Cockerham’s method.

BIN_START and BIN_END: Define the start and end of each genomic window.

N_VARIANTS: Number of variants in each window.

WEIGHTED_FST: Weighted Fst value for each window.

MEAN_FST: Mean Fst value across all variants in each window.

Theoretically, Fst values range from 0 to 1, where 0 indicates no genetic differentiation between populations, and 1 indicates complete differentiation. However, in practice, it’s possible to observe negative Fst or -nan values due to several factors

For negative Fst values, you can often treat them as 0 in downstream analysis, indicating no significant differentiation.

For -nan values, these typically represent insufficient data or fixed sites with no polymorphism, so they can be ignored or filtered out.

(5) Select your interested site or region with high Fst

Look up for highly differentiated regions

We are interested in loci with high value of Fst because they can be indicative of population-specific genetics features, as the result of genetic drift of natural selection. Filter the files with the Fsts for WEIGHTED_FST and/or MEAN_FST greather than a threshold you like. We can use awk, a powerful programming language and command-line tool used in Unix and Unix-like systems for processing and analyzing text files, particularly useful for manipulating data, generating reports, and performing complex pattern matching.

For example, filter for WEIGHTED_FST> 0.2 or MEAN_FST>0.1:

sort -k5 -r popa_vs_popb_FST_window.windowed.weir.fst |less -S
awk 'NR==1 || $5 > 0.2' popa_vs_popb_FST_window.windowed.weir.fst > highWeightedFst.list
## NR==1 retain the header   
## $5 >0.2 filters the fifth column, i.e. the WEIGHTED_FST, for values greather than 0.2 
awk 'NR==1 || $6 > 0.1' popa_vs_popb_FST_window.windowed.weir.fst > highMeanFst.list
## $6 >0.1 filters the sixth column, i.e. the MEAN_FST, for values greather than 0.1