Hands‐On Exercise‐VC - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Hands-On Exercise
In this final exercise, you’ll take your assembled or simulated reads and:
- Call SNPs & indels
- Convert, filter & index the resulting VCF
- Annotate with SnpEff and summarise key variant metrics
Tools & Requirements
- BWA-MEM (≥ 0.7.17) – read alignment
- samtools (≥ 1.10) & bcftools (≥ 1.10) – BAM/VCF processing & variant calling
- SnpEff (≥ 5.0) – effect prediction
- tabix/bgzip – VCF compression & indexing
- R or Python (optional) – downstream metrics & plots
1. Variant Calling
Assuming you have:
- Reference genome:
ref.fa
- Paired-end reads:
reads_R1.fq.gz
&reads_R2.fq.gz
# 1.1 Index the reference
bwa index ref.fa
samtools faidx ref.fa
# 1.2 Align reads → BAM
bwa mem -t 8 ref.fa reads_R1.fq.gz reads_R2.fq.gz \
| samtools view -bS - \
> aln.raw.bam
# 1.3 Sort & index
samtools sort -o aln.sorted.bam aln.raw.bam
samtools index aln.sorted.bam
# 1.4 Call raw variants (SNPs & indels)
bcftools mpileup -f ref.fa -Ou aln.sorted.bam \
| bcftools call -mv -Oz -o raw.vcf.gz
# 1.5 Index raw VCF
tabix -p vcf raw.vcf.gz
Expected: a raw.vcf.gz
containing all putative SNPs and indels.
2. Convert, Filter & Index
# 2.1 Hard‐filter: keep QUAL≥30, DP≥10
bcftools filter \
-i 'QUAL>=30 && DP>=10' \
raw.vcf.gz \
-Oz -o filtered.vcf.gz
# 2.2 Index the filtered VCF
tabix -p vcf filtered.vcf.gz
# 2.3 Quick summary
bcftools stats filtered.vcf.gz > stats.txt
Expected:
filtered.vcf.gz
with only high-confidence callsstats.txt
summarising counts, Ts/Tv ratio, etc.
3. Annotate with SnpEff & Summarize
# 3.1 Annotate effects
snpeff -v GRCh38.86 filtered.vcf.gz \
| bgzip -c > annotated.vcf.gz
# 3.2 Index annotated VCF
tabix -p vcf annotated.vcf.gz
# 3.3 Extract key metrics
# total variants, SNP vs INDEL, Ts/Tv, top effects
bcftools query \
-f '%TYPE\n' annotated.vcf.gz \
| sort | uniq -c > variant_types.txt
bcftools stats annotated.vcf.gz \
| grep ^TSTV > ts_tv_ratio.txt
bcftools query \
-f '%INFO/ANN\n' annotated.vcf.gz \
| cut -d'|' -f2 \
| sort | uniq -c \
> effect_counts.txt
Example Metrics Table
Metric | Command snippet | Example Value |
---|---|---|
Total variants | wc -l filtered.vcf.gz |
3,450 |
SNPs vs INDELs | variant_types.txt |
SNP:3,200; INDEL:250 |
Ti/Tv ratio | ts_tv_ratio.txt |
2.1 |
Missense variants | grep 'missense_variant' effect_counts.txt |
620 |
Synonymous variants | grep 'synonymous_variant' effect_counts.txt |
410 |
🎉 You’ve now:
- Turned raw alignments into a high-confidence, filtered VCF
- Predicted functional effects with SnpEff
- Generated summary reports on variant types, quality, and predicted impact
Feel free to load annotated.vcf.gz
in IGV or import effect_counts.txt
into R/Python for further visualisation!