population_structure - BGIGPD/BestPractices4Pathogenomics GitHub Wiki
Welcome to the population_structure wiki
Workshop: Population structure
Backgroud information
The aim is to explore the population structure of parasites/vectors across various geographic regions through genomic analysis. We will explore different ways of assessing population structure (differentiation). You can think of population structure as identifying clusters or groups of more closely related individuals among these groups. Examining population structure can give us a great deal of insight into the history and origin of populations. Populations can be studied to determine if they are structured by using, for example, principal components analysis (PCA), clustering tools (e.g. admixture) or phylogenetic trees,population differentiation summary statistics (e.g. Fst). These techniques answer questions about how genetic diversity is organized within and across populations, providing insights into evolutionary relationships and potential adaptation to local environmental pressures.
PCA
First of all we will investigate population structure using principal components analysis.
• A dimensionality reduction technique widely used in data analysis, machine learning, and statistics.
• It’s a powerful tool for visualizing and simplifying high-dimensional data while preserving essential information.
• PCA achieves this by transforming the original data into a new coordinate system where the axes are called principal components,with each axis independent of the next (i.e. there should be no correlation between them).
In the context of genetic data, PCA summarizes the major axes of variation in allele frequencies and then produces the coordinates of individuals along these axes, it's model-free method and is typically simple to apply and relatively easy to interpret.
Admixture
An admixture model reveal how genetic material moves between populations, often correlating with geographic proximity. ADMIXTURE is a clustering software with the aim to infer populations and individual ancestries. Admixture is a very useful and popular tool to analyse SNP data. It performs an unsupervised clustering of large numbers of samples and allows each individual to be a mixture of clusters.
Genetic admixture occurs when previously isolated populations interbreed resulting in a population that is descended from multiple sources
Fst
Some Genes exhibited high differentiation between populations, indicative of adaptive changes. Detected through fixation index (FST) values and selection metrics across populations, highlighting the role of environmental pressures in shaping genetic diversity. Fst can be interpreted as measuring how much closer two individuals from the same subpopulation are, compared to the total population. Fst ranges from 0 to 1. Populations that share many SNPs and have similar allele frequencies have low Fst (close to 0). Populations with many SNP differences between them will have high Fst (close to 1).
Objectives
Now that we have a fully filtered VCF, we can start do some cool analyses with it.
PCA
Learn how to implement PCA of population structure from a VCF file by PLINK
Learn how to write R code to visualize PCA output
Admixture
Learn how to implement admixture of population structure from a VCF file by admixture
Learn how to write R code to visualize Admixture output
Fst
How to estimate Fst from a VCF file by vcftools
Softwares and Databases
plink2
admixture
vcftools
bcftools
Rstudio
Steps
PCA
Install plink2
conda install bioconda::plink2
Running plink2
(1)Preparing files in plink The first thing we need to do is made a bed file and the associated plink format files using the following command in plink
plink2 --vcf random_50.vcf.gz --make-bed --allow-extra-chr --out A_alb
## the demo file **random_50.vcf.gz** is in /home/zhaohailong/population_structure2
We will be using the demo vcf file I have prepared, which contains the SNP calls from 50 individuals. The make bed command means that the vcf file information will be used to create a bed file and plink specific files it needs to perform other functions.
You should now have files with a suffix bed,fam,ped and bim.
- A_alb.bed - - this is a binary file necessary for admixture analysis. It is essentially the genotypes of the pruned dataset recoded as 1s and 0s.
- A_alb.bim - a map file (i.e. information file) of the variants contained in the bed file.
- A_alb.fam - a map file for the individuals contained in the bed file.
(2)Generating eigenvalues
Using the files plink generated, we can now generate the eigenvalues
plink2 --bfile A_alb --pca --allow-extra-chr --out A_alb
## –bfile means the prefix of needed files generated by (1) step
## –pca command means generate eigenvalues
- civ_eth.eigenval - the eigenvalues from our analysis
- civ_eth.eigenvec- the eigenvectors from our analysis
The eigenval tells you in order of each PC ( so PC1,PC2….) the percentage each eigenvalue contributes to the variance. The eigenvec contains the coordinates for each sample. PC1 being the most explanatory PC of the data.
For instance if PC1 explains 61%, this means that all of the other PCs most account for 39% of the variance observed in the data. PC2 will have the next largest contribution to the genetic variance, for example PC2 may contribute 20%. The overall variance should add up to 100%.
Tranfer A_alb.eigenvec on server to local computer
See previous course (https://github.com/BGIGPD/BestPractices4Pathogenomics/wiki/VirusGenesAnnotation) for guide if you forget it.
Plotting the PCA output
Next we turn to R to plot the analysis we have produced!
library(ggplot2)
eigenvec <- read.table("/Users/zhaohailong/Desktop/A_alb.eigenvec", header = F)
colnames(eigenvec) <- c("FID", "IID", paste0("PC", 1:(ncol(eigenvec) - 2)))
# Assign FID based on conditions, randomly assign each sample to some groups (in your future data, this group is likely to be geographic locations)
eigenvec$FID <- ifelse(eigenvec$IID == "M19SYYW1343", "A",
ifelse(eigenvec$IID == "21HNDZ8YW1003", "B", "C"))
eigenval <- scan("/Users/zhaohailong/Desktop/A_alb.eigenval")
variance_explained <- eigenval / sum(eigenval) * 100
variance_labels <- paste0("PC", 1:2, " (", round(variance_explained[1:2], 2), "%)")
# Plot PC1 vs PC2
ggplot(eigenvec, aes(x = PC1, y = PC2)) +
geom_point(aes(color = as.factor(FID))) + # Optional: color points by FID
labs(x = variance_labels[1], y = variance_labels[2]) +
ggtitle("PCA Plot of mosquito/parasite Samples") +
theme_minimal() +
theme(legend.title = element_blank())
Admixture
Install
conda install bioconda::admixture
Admixture analysis
(1)Generating the input file
We have already generated the input file in plink format in previous PCA training. But admixture does not accept chromosome names that are not human chromosomes. We will thus just exchange the first column by 0.
awk '{$1="0";print $0}' A_alb.bim > A_alb.bim.tmp
head A_alb.bim.tmp
mv A_alb.bim.tmp A_alb.bim
(2)Running Admixture
The basic syntax is dead-easy:
admixture --cv $INPUT.bed $K >log2.out
Here $K is a number indicating the number of clusters you want to infer.
admixture --cv A_alb.bed 3 >admixture.log3.out
It produces two files with endings .$K.Q and .$K.P
.Q which contains cluster assignments for each individual
.P which contains for each SNP the population allele frequencies.
(3)How do we know what number of clusters K to use? We should run several numbers of clusters, e.g. all numbers between K=3 and K=12 for a start. Then, admixture has a built-in method to evaluate a “cross-validation error” for each K. Computing this “cross-validation error” requires simply to give the flag --cv right after the call to admixture. Let’s now run it in a for loop with K=3 to K=5 and direct the output into log files
for i in {3..5}
do
admixture --cv $FILE.bed $i > log${i}.out
done
If things run successfully, you should now have a .Q and a .P file in your output directory for every K that you ran.
To identify the best value of k clusters which is the value with lowest cross-validation error, we need to collect the cv errors. Below are three different ways to extract the number of K and the CV error for each corresponding K. There are many ways to achieve the same thing in bioinformatics!
awk '/CV/ {print $3,$4}' *out | cut -c 4,7-20 > $FILE.cv.error
grep "CV" *out | awk '{print $3,$4}' | sed -e 's/(//;s/)//;s/://;s/K=//' > $FILE.cv.error
grep "CV" *out | awk '{print $3,$4}' | cut -c 4,7-20 > $FILE.cv.error
Visualize output in R
The demo vcf file we used only have 50 samples and the fst are too small, in order to visualize population structure more clearly , here we'll use another admixture output file which with more samples.
Here is the code for making the typical ADMIXTURE-barplot for K=3:
# Load data
Q_data <- read.table("/Users/zhaohailong/Desktop/admixture.Aedes_albopictus_output.3.Q", header = FALSE)
# Rename columns to represent clusters
colnames(Q_data) <- c("Cluster1", "Cluster2", "Cluster3")
# Add an individual identifier
Q_data$Individual <- 1:nrow(Q_data)
# Reshape data for ggplot2
Q_data_long <- melt(Q_data, id.vars = "Individual", variable.name = "Cluster", value.name = "Proportion")
# Create bar plot,by default, the bar plot in the code above will display individuals in the order they appear in the .Q file.
ggplot(Q_data_long, aes(x = factor(Individual), y = Proportion, fill = Cluster)) +
geom_bar(stat = "identity",width = 1) +
labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (K=3)") +
theme_minimal() +
theme(axis.text.x = element_blank(), # Remove x-axis labels for clarity
axis.ticks.x = element_blank())
Perform hierarchical clustering based on ancestry proportions
distance_matrix <- dist(Q_data[, 1:3]) # Only use Cluster columns for clustering
cluster_order <- hclust(distance_matrix)$order
Q_data <- Q_data[cluster_order, ]
Q_data$Individual <- factor(Q_data$Individual, levels = Q_data$Individual)
Q_data_long <- melt(Q_data, id.vars = "Individual", variable.name = "Cluster", value.name = "Proportion")
# Plot with clustering-based order
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
geom_bar(stat = "identity", width = 1) +
labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (Clustered Order)") +
theme_minimal() +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
Randomly assign each individual to one of five regions
regions <- c("Region1", "Region2", "Region3", "Region4", "Region5")
Q_data$Region <- sample(regions, nrow(Q_data), replace = TRUE)
# Reorder data by Region for grouping in the plot
Q_data <- Q_data[order(Q_data$Region), ]
Q_data$Individual <- factor(Q_data$Individual, levels = Q_data$Individual)
# Reshape data for plotting
Q_data_long <- melt(Q_data, id.vars = c("Individual", "Region"), variable.name = "Cluster", value.name = "Proportion")
# Plot the barplot grouped by Region
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
geom_bar(stat = "identity", position = "stack", width = 1) + # Set width to 1
facet_grid(~ Region, scales = "free_x", space = "free_x") +
labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot by Region") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.text = element_text(size = 10))
Change different color
library(RColorBrewer)
ggplot(Q_data_long, aes(x = Individual, y = Proportion, fill = Cluster)) +
geom_bar(stat = "identity",position="stack", width = 1, alpha = 0.8) + #alpha is ratio of transparency
facet_grid(~ Region, scales = "free_x", space = "free_x") +
scale_fill_brewer(palette = "Set3") + # use the Set 3 color schemes of RColorBrewer
#OR# scale_fill_manual(values = c("#FF6B6B", "#4ECDC4", "#556270")) + # assign color by yourself
labs(x = "Individuals", y = "Ancestry Proportion", title = "Admixture Plot (K=3)") +
theme_minimal() +
theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank())
Fst
Assessing genetic diversity almost always starts with an analysis of a parameter such as FST.
Install vcftools
conda install bcftools
# but when you encounter this kind of error
# bcftools: error while loading shared libraries: libgsl.so.25: cannot open shared object file: No such file or directory
Solution : Use Conda to Install GSL and bcftools
You can install both bcftools and its dependencies in an isolated environment.
conda create -n bcf -c bioconda -c conda-forge bcftools
conda activate bcf
Running vcftools
Fst is a statistical measure that tells us how different are two populations at the genetic level. It can be calculated for a single site, over a window (average across sites in the window), or across the entire genome.
Firstly, we randomly split the samples from a VCF file into two groups for Fst calculation.
(1) Extract the sample list from the VCF file and saves them to all_samples.txt
bcftools query -l random_50.vcf.gz > all_samples.txt
(2) Shuffle the sample list randomly
shuf all_samples.txt -o shuffled_samples.txt
#randomly shuffles the sample list from all_samples.txt and outputs the result to shuffled_samples.txt. This step ensures the samples are randomly ordered before splitting
(3) Split the shuffled sample list into two files
# The `split` command will divide the shuffled file into two parts (approximately equal)
split -n l/2 shuffled_samples.txt population_list
Now, you can use these two files (population_lista and population_listb) as input for vcftools to calculate Fst values between the two randomly generated groups.
(4) Fst Calculation
vcftools --gzvcf random_50.vcf.gz --weir-fst-pop population_listaa --weir-fst-pop population_listab --out popa_vs_popb_FST
## --weir-fst-pop specify a file that contain lists of individuals (one per line) that are members of a population.
## The function will work with multiple populations if multiple --weir-fst-pop arguments are used.
Or we can calculate the fst in each genomic window
vcftools --gzvcf random_50.vcf.gz --weir-fst-pop population_listaa --weir-fst-pop population_listab --fst-window-size 10000 --out popa_vs_popb_FST_window
## --fst-window-size indicate the size of the window in base pairs.
The first line of the files says that: In this region there are 13 variants and the weighted avearge Fst is 0.00029
WEIR_AND_COCKERHAM_FST: Site-specific Fst using Weir and Cockerham’s method.
BIN_START and BIN_END: Define the start and end of each genomic window.
N_VARIANTS: Number of variants in each window.
WEIGHTED_FST: Weighted Fst value for each window.
MEAN_FST: Mean Fst value across all variants in each window.
Theoretically, Fst values range from 0 to 1, where 0 indicates no genetic differentiation between populations, and 1 indicates complete differentiation. However, in practice, it’s possible to observe negative Fst or -nan values due to several factors
For negative Fst values, you can often treat them as 0 in downstream analysis, indicating no significant differentiation.
For -nan values, these typically represent insufficient data or fixed sites with no polymorphism, so they can be ignored or filtered out.
(5) Select your interested site or region with high Fst
Look up for highly differentiated regions
We are interested in loci with high value of Fst because they can be indicative of population-specific genetics features, as the result of genetic drift of natural selection. Filter the files with the Fsts for WEIGHTED_FST and/or MEAN_FST greather than a threshold you like. We can use awk, a powerful programming language and command-line tool used in Unix and Unix-like systems for processing and analyzing text files, particularly useful for manipulating data, generating reports, and performing complex pattern matching.
For example, filter for WEIGHTED_FST> 0.2 or MEAN_FST>0.1:
sort -k5 -r popa_vs_popb_FST_window.windowed.weir.fst |less -S
awk 'NR==1 || $5 > 0.2' popa_vs_popb_FST_window.windowed.weir.fst > highWeightedFst.list
## NR==1 retain the header
## $5 >0.2 filters the fifth column, i.e. the WEIGHTED_FST, for values greather than 0.2
awk 'NR==1 || $6 > 0.1' popa_vs_popb_FST_window.windowed.weir.fst > highMeanFst.list
## $6 >0.1 filters the sixth column, i.e. the MEAN_FST, for values greather than 0.1