Population Level Analyses - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Population-Level Analyses

Once you have a multi-sample VCF (e.g. after bcftools merge), you can characterize variation across your cohort by computing allele-frequency spectra and testing for linkage disequilibrium. Below are two common analyses using vcftools.

1. Allele Frequency Spectrum

What it is
The allele frequency spectrum (AFS) tallies how often each alternate allele appears in your samples. It’s the foundation for population-genetic summaries (e.g. Tajima’s D, Site Frequency Spectrum).

Tool
VCFtools

Install

# via conda (recommended)
conda install -c bioconda vcftools

# or system-wide on Ubuntu
sudo apt update && sudo apt install vcftools

Command

# compute per‐site allele frequencies
vcftools \
  --gzvcf merged.vcf.gz \
  --freq \
  --out cohort_freq

# output files:
#   cohort_freq.frq        # allele counts + frequencies
#   cohort_freq.frq.id     # site IDs

cohort_freq.frq (tab‐delimited)

   CHROM       POS   N_ALLELES   N_CHR   {ALLELE:COUNT:FREQ}...
1  NC_000913.3 1000  2           6       A:5:0.8333  G:1:0.1667
2  NC_000913.3 2000  2           6       T:3:0.5000  C:3:0.5000
...
> **Tip:** if you want the *folded* spectrum (minor‐allele only), run:
```bash
vcftools --gzvcf merged.vcf.gz --freq2 --out cohort_freq_folded

You can then plot the site‐frequency spectrum in R or Python, e.g.:

# R example
afs <- read.table("cohort_freq.frq", header=TRUE)
# extract minor‐allele freq
maf <- apply(afs[,5:ncol(afs)], 1, function(x){
  fr <- as.numeric(sub(".*:(.*)$","\\1", x[grepl(":",x)]))
  min(fr)
})
hist(maf, breaks=20, main="Minor Allele Frequency Spectrum", xlab="MAF")

2. Linkage Disequilibrium

What it is
Linkage disequilibrium (LD) measures non-random association between alleles at two sites (e.g. r²). High LD over short distances is expected in clonal or low-recombination populations.

Tool
VCFtools (computed pairwise r²)

Command

# compute pairwise r^2 between all SNPs within 50 kb sliding windows
vcftools \
  --gzvcf merged.vcf.gz \
  --geno-r2 \
  --ld-window-bp 50000 \
  --out cohort_ld

# output file:
#   cohort_ld.ld   # three‐column: POS1  POS2  R2

cohort_ld.ld

   CHROM   POS1   POS2   R2
1  NC_000913.3 1000  1500  0.24
2  NC_000913.3 1000  2000  0.03
3  NC_000913.3 1500  2000  0.12
...
## Plotting LD Decay

In R, you can aggregate r² by distance:

```r
ld <- read.table("cohort_ld.ld", header = TRUE)
ld$dist <- abs(ld$POS2 - ld$POS1)
# bin distances
library(dplyr)
binned <- ld %>%
  group_by(bin = cut(dist, breaks = seq(0, 50000, by = 1000))) %>%
  summarize(mean_r2 = mean(R2, na.rm = TRUE))

plot(
  seq(500, 50000, by = 1000),
  binned$mean_r2,
  type = "b", pch = 16,
  xlab = "Distance (bp)",
  ylab = expression(bar(r^2)),
  main = "LD Decay"
)

Tips

  • You can limit SNPs (e.g. --maf 0.05) to focus on common variants.
  • For very large panels, sample down to speed up LD calculations.
  • Alternative tools: PLINK for sliding‐window LD matrices.

With allele‐frequency spectra and LD curves in hand, you have a population‐genetic overview of diversity and recombination in your cohort—key for downstream demographic inference, association mapping, or evolutionary analyses.

⚠️ **GitHub.com Fallback** ⚠️