Filtering and Cleanup - igheyas/Bioinformatics GitHub Wiki
Filtering & Cleanup
Before using your alignments for variant calling or other analyses, it’s best to clean up the BAM:
- Remove PCR duplicates
PCR/sequencing duplicates can inflate coverage and bias variant calls. You can either mark or remove them:
# 1) Sort by coordinate (you already did this)
samtools sort -@8 -o aln.bwa.sorted.bam aln.bwa.bam
Output:
# 2) Sort by read name (needed for fixmate)
samtools sort -n -@8 -o aln.nameSorted.bam aln.bwa.sorted.bam
Output:
# 3) Fix mate information (adds ms and mq tags)
samtools fixmate -m aln.nameSorted.bam aln.fixmate.bam
# 4) Resort by coordinate (markdup expects coord-sorted)
samtools sort -@8 -o aln.fixmate.sorted.bam aln.fixmate.bam
Output
# 5) Mark (or remove) duplicates
# - To *mark* dups:
samtools markdup aln.fixmate.sorted.bam aln.markdup.bam
# - Or to *remove* them directly:
samtools markdup -r aln.fixmate.sorted.bam aln.dedup.bam
-Filter by mapping quality & FLAG bits Drop low‐confidence or secondary/supplementary alignments to keep only your best, primary mappings:
# keep only reads with MAPQ ≥ 30, remove unmapped(0x4), secondary(0x100) & supplementary(0x800)
samtools view -b \
-q 30 \
-F 0x904 \
aln.dedup.bam \
> aln.filtered.bam
# index for fast access
samtools index aln.filtered.bam
-q 30
require MAPQ ≥ 30-F 0x904
mask out the bits 0x4 (unmapped), 0x100 (secondary), and 0x800 (supplementary)
After these steps, aln.filtered.bam
contains a clean, duplicate-free, high-quality alignment ready for variant calling, coverage analysis, or visualization.