Calling Variants - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Calling Variants

Pileup-based callers (bcftools mpileup + call)

  1. Exit your InSilicoSeq virtual environment (if youโ€™re still in it):
deactivate
  1. Install bcftools and tabix (needed to call and index VCFs):

    sudo apt update
    sudo apt install -y bcftools tabix
    
  2. Re-enter your project directory and (re)activate your ISS env:

    cd /mnt/c/Users/IAGhe/OneDrive/Documents/Bioinformatics
    source iss_env/bin/activate
    
  3. Run the pileup โ†’ call pipeline:

    bcftools mpileup \
      -f ref_genome.fa \
      -Ou \
      aln.filtered.bam \
    | bcftools call \
      -mv \
      -Oz \
      -o raw_snps_indels.vcf.gz
    
  4. Index the VCF (for fast querying):

    tabix -p vcf raw_snps_indels.vcf.gz
    
  5. Take a peek at your variants:

    bcftools view raw_snps_indels.vcf.gz | head -n 30
    
  6. (Optional) View in a nice column format:

    bcftools view raw_snps_indels.vcf.gz \
      | column -t \
      | less -S
    

All of the above is adapted from the InSilicoSeq hands-on demo.


Haplotype-aware callers

  • GATK HaplotypeCaller
# Requires GATK installed (via Conda or you can download the jar)
gatk HaplotypeCaller \
  -R ref_genome.fa \
  -I aln.filtered.bam \
  -O gatk_variants.vcf.gz
  • FreeBayes
# A lightweight, multiallelic haplotype caller
freebayes \
  -f ref_genome.fa \
  aln.filtered.bam \
  > freebayes_variants.vcf

(Commands to run these would follow a similar pattern: index the BAM, then invoke the caller with -R ref_genome.fa, -I aln.sorted.bam, and appropriate output flags.)


Long-read callers

  • Medaka (for ONT data)
# After mapping ONT reads with minimap2
medaka_variant \
  -i ont_reads.fastq.gz \
  -d ref_genome.fa \
  -b ont_aln.bam \
  -o medaka_variants
  • Clair3
# Clair3 Docker or conda install; specify platform 'ont' or 'hifi'
clair3.sh \
  --bam ont_aln.bam \
  --ref ref_genome.fa \
  --threads 8 \
  --platform ont \
  --output clair3_out

(These tools take long-read alignments (e.g. minimap2 โ†’ BAM), then call variants using neural-network models trained on long-read error profiles.)