SAM Format Deep Dive - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

SAM Format Deep Dive

Once you have your reads aligned (e.g. with bwa mem or bowtie2), your aligner will output a SAM file. SAM (Sequence Alignment/Map) is a TAB-delimited text format where each line describes one alignment. Below we break down the key parts.


1. Mandatory Fields

Every alignment line in SAM must have at least these 11 columns:

Column Name Description
1 QNAME Query (read) name / identifier
2 FLAG Bitwise flag encoding orientation + status
3 RNAME Reference sequence name (e.g. chromosome, contig)
4 POS 1-based leftmost mapping position on reference
5 MAPQ Mapping quality (phred-scaled)
6 CIGAR Describes alignment (M/I/D/N/S/H/P/=,X operations)
7 RNEXT Mate/reference name (“=” if same as RNAME)
8 PNEXT Position of the mate/next read
9 TLEN Observed template length (insert size)
10 SEQ Read sequence
11 QUAL ASCII-encoded base quality scores

2. SAM FLAG Bits

The FLAG field is a bitwise integer where each bit describes something about the alignment or read:

Bit (hex) Bit (dec) Meaning
0x1 1 template has multiple segments (paired)
0x2 2 each segment properly aligned (paired)
0x4 4 this segment is unmapped
0x8 8 next segment in template is unmapped
0x10 16 SEQ is reverse complemented
0x20 32 SEQ of the next segment is reverse comp.
0x40 64 this is the first segment in the template
0x80 128 this is the last segment in the template
0x100 256 secondary alignment
0x200 512 read fails platform/vendor quality checks
0x400 1024 PCR or optical duplicate
0x800 2048 supplementary alignment

Tip: To decode a FLAG, convert it to binary and note which bits are set.


3. CIGAR Strings

The CIGAR column encodes how the read aligns to the reference:

CIGAR Op Meaning
M alignment match (can be a match or mismatch)
I insertion to the reference
D deletion from the reference
N skipped region from the reference (e.g. intron)
S soft clipping (clipped sequences present in SEQ)
H hard clipping (clipped sequences not present in SEQ)
P padding (silent deletion from padded reference)
= sequence match
X sequence mismatch

Examples:

  • 100M → 100 bp aligned (match/mismatch)
  • 50M10I40M → 50 bp match, 10 bp insertion, 40 bp match
  • 10S90M → first 10 bp soft-clipped, 90 bp aligned

4. Optional Tags

After the 11 mandatory columns, you may see arbitrary TAG:TYPE:VALUE fields. Common ones include:

TAG TYPE Meaning
NM i Edit distance (# mismatches + indels) between read & reference
MD Z String encoding mismatched positions (reconstruct alignment)
AS i Alignment score (aligner-specific)
XS A Suboptimal alignment strand (e.g. for splice-aware mappers)
RG Z Read group identifier (for multi-sample pipelines)
BC Z Barcode sequence (for single-cell / multiplexed data)
SA Z Other supplementary alignments (chimeric / split reads)

Tip: Use samtools view --tags or grep to filter by specific tags.


With these pieces in hand, you can fully interpret any SAM record:

  • Who? (QNAME)
  • Where & how? (RNAME, POS, CIGAR)
  • Orientation/status? (FLAG)
  • Quality metrics? (MAPQ, NM, AS)

Understanding SAM’s structure is key to downstream processing (conversion to BAM, sorting, indexing, variant calling, etc.).