Filtering and Cleanup - igheyas/Bioinformatics GitHub Wiki

Filtering & Cleanup

Before using your alignments for variant calling or other analyses, it’s best to clean up the BAM:

  • Remove PCR duplicates
    PCR/sequencing duplicates can inflate coverage and bias variant calls. You can either mark or remove them:
# 1) Sort by coordinate (you already did this)
samtools sort -@8 -o aln.bwa.sorted.bam aln.bwa.bam

Output:

# 2) Sort by read name (needed for fixmate)
samtools sort -n -@8 -o aln.nameSorted.bam aln.bwa.sorted.bam

Output:

# 3) Fix mate information (adds ms and mq tags)
samtools fixmate -m aln.nameSorted.bam aln.fixmate.bam
# 4) Resort by coordinate (markdup expects coord-sorted)
samtools sort -@8 -o aln.fixmate.sorted.bam aln.fixmate.bam

Output

# 5) Mark (or remove) duplicates
#    - To *mark* dups:
samtools markdup aln.fixmate.sorted.bam aln.markdup.bam

#    - Or to *remove* them directly:
samtools markdup -r aln.fixmate.sorted.bam aln.dedup.bam

-Filter by mapping quality & FLAG bits Drop low‐confidence or secondary/supplementary alignments to keep only your best, primary mappings:

# keep only reads with MAPQ ≥ 30, remove unmapped(0x4), secondary(0x100) & supplementary(0x800)
samtools view -b \
  -q 30 \
  -F 0x904 \
  aln.dedup.bam \
  > aln.filtered.bam

# index for fast access
samtools index aln.filtered.bam

  • -q 30 require MAPQ ≥ 30
  • -F 0x904 mask out the bits 0x4 (unmapped), 0x100 (secondary), and 0x800 (supplementary)

After these steps, aln.filtered.bam contains a clean, duplicate-free, high-quality alignment ready for variant calling, coverage analysis, or visualization.