Population Level Analyses - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Once you have a multi-sample VCF (e.g. after bcftools merge
), you can characterize variation across your cohort by computing allele-frequency spectra and testing for linkage disequilibrium. Below are two common analyses using vcftools
.
What it is
The allele frequency spectrum (AFS) tallies how often each alternate allele appears in your samples. It’s the foundation for population-genetic summaries (e.g. Tajima’s D, Site Frequency Spectrum).
Tool
VCFtools
Install
# via conda (recommended)
conda install -c bioconda vcftools
# or system-wide on Ubuntu
sudo apt update && sudo apt install vcftools
Command
# compute per‐site allele frequencies
vcftools \
--gzvcf merged.vcf.gz \
--freq \
--out cohort_freq
# output files:
# cohort_freq.frq # allele counts + frequencies
# cohort_freq.frq.id # site IDs
cohort_freq.frq (tab‐delimited)
CHROM POS N_ALLELES N_CHR {ALLELE:COUNT:FREQ}...
1 NC_000913.3 1000 2 6 A:5:0.8333 G:1:0.1667
2 NC_000913.3 2000 2 6 T:3:0.5000 C:3:0.5000
...
> **Tip:** if you want the *folded* spectrum (minor‐allele only), run:
```bash
vcftools --gzvcf merged.vcf.gz --freq2 --out cohort_freq_folded
You can then plot the site‐frequency spectrum in R or Python, e.g.:
# R example
afs <- read.table("cohort_freq.frq", header=TRUE)
# extract minor‐allele freq
maf <- apply(afs[,5:ncol(afs)], 1, function(x){
fr <- as.numeric(sub(".*:(.*)$","\\1", x[grepl(":",x)]))
min(fr)
})
hist(maf, breaks=20, main="Minor Allele Frequency Spectrum", xlab="MAF")
What it is
Linkage disequilibrium (LD) measures non-random association between alleles at two sites (e.g. r²). High LD over short distances is expected in clonal or low-recombination populations.
Tool
VCFtools (computed pairwise r²)
Command
# compute pairwise r^2 between all SNPs within 50 kb sliding windows
vcftools \
--gzvcf merged.vcf.gz \
--geno-r2 \
--ld-window-bp 50000 \
--out cohort_ld
# output file:
# cohort_ld.ld # three‐column: POS1 POS2 R2
cohort_ld.ld
CHROM POS1 POS2 R2
1 NC_000913.3 1000 1500 0.24
2 NC_000913.3 1000 2000 0.03
3 NC_000913.3 1500 2000 0.12
...
## Plotting LD Decay
In R, you can aggregate r² by distance:
```r
ld <- read.table("cohort_ld.ld", header = TRUE)
ld$dist <- abs(ld$POS2 - ld$POS1)
# bin distances
library(dplyr)
binned <- ld %>%
group_by(bin = cut(dist, breaks = seq(0, 50000, by = 1000))) %>%
summarize(mean_r2 = mean(R2, na.rm = TRUE))
plot(
seq(500, 50000, by = 1000),
binned$mean_r2,
type = "b", pch = 16,
xlab = "Distance (bp)",
ylab = expression(bar(r^2)),
main = "LD Decay"
)
- You can limit SNPs (e.g.
--maf 0.05
) to focus on common variants. - For very large panels, sample down to speed up LD calculations.
- Alternative tools: PLINK for sliding‐window LD matrices.
With allele‐frequency spectra and LD curves in hand, you have a population‐genetic overview of diversity and recombination in your cohort—key for downstream demographic inference, association mapping, or evolutionary analyses.