Hands‐On Exercise‐VC - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Hands-On Exercise

In this final exercise, you’ll take your assembled or simulated reads and:

  1. Call SNPs & indels
  2. Convert, filter & index the resulting VCF
  3. Annotate with SnpEff and summarise key variant metrics

Tools & Requirements

  • BWA-MEM (≥ 0.7.17) – read alignment
  • samtools (≥ 1.10) & bcftools (≥ 1.10) – BAM/VCF processing & variant calling
  • SnpEff (≥ 5.0) – effect prediction
  • tabix/bgzip – VCF compression & indexing
  • R or Python (optional) – downstream metrics & plots

1. Variant Calling

Assuming you have:

  • Reference genome: ref.fa
  • Paired-end reads: reads_R1.fq.gz & reads_R2.fq.gz
# 1.1 Index the reference
bwa index ref.fa
samtools faidx ref.fa

# 1.2 Align reads → BAM 
bwa mem -t 8 ref.fa reads_R1.fq.gz reads_R2.fq.gz \
  | samtools view -bS - \
  > aln.raw.bam

# 1.3 Sort & index
samtools sort -o aln.sorted.bam aln.raw.bam
samtools index aln.sorted.bam

# 1.4 Call raw variants (SNPs & indels)
bcftools mpileup -f ref.fa -Ou aln.sorted.bam \
  | bcftools call -mv -Oz -o raw.vcf.gz

# 1.5 Index raw VCF
tabix -p vcf raw.vcf.gz

Expected: a raw.vcf.gz containing all putative SNPs and indels.

2. Convert, Filter & Index

# 2.1 Hard‐filter: keep QUAL≥30, DP≥10
bcftools filter \
  -i 'QUAL>=30 && DP>=10' \
  raw.vcf.gz \
  -Oz -o filtered.vcf.gz

# 2.2 Index the filtered VCF
tabix -p vcf filtered.vcf.gz

# 2.3 Quick summary
bcftools stats filtered.vcf.gz > stats.txt

Expected:

  • filtered.vcf.gz with only high-confidence calls
  • stats.txt summarising counts, Ts/Tv ratio, etc.

3. Annotate with SnpEff & Summarize

# 3.1 Annotate effects
snpeff -v GRCh38.86 filtered.vcf.gz \
  | bgzip -c > annotated.vcf.gz

# 3.2 Index annotated VCF
tabix -p vcf annotated.vcf.gz

# 3.3 Extract key metrics
#   total variants, SNP vs INDEL, Ts/Tv, top effects
bcftools query \
  -f '%TYPE\n' annotated.vcf.gz \
| sort | uniq -c > variant_types.txt

bcftools stats annotated.vcf.gz \
  | grep ^TSTV > ts_tv_ratio.txt

bcftools query \
  -f '%INFO/ANN\n' annotated.vcf.gz \
  | cut -d'|' -f2 \
  | sort | uniq -c \
  > effect_counts.txt

Example Metrics Table

Metric Command snippet Example Value
Total variants wc -l filtered.vcf.gz 3,450
SNPs vs INDELs variant_types.txt SNP:3,200; INDEL:250
Ti/Tv ratio ts_tv_ratio.txt 2.1
Missense variants grep 'missense_variant' effect_counts.txt 620
Synonymous variants grep 'synonymous_variant' effect_counts.txt 410

🎉 You’ve now:

  • Turned raw alignments into a high-confidence, filtered VCF
  • Predicted functional effects with SnpEff
  • Generated summary reports on variant types, quality, and predicted impact

Feel free to load annotated.vcf.gz in IGV or import effect_counts.txt into R/Python for further visualisation!