SAM Format Deep Dive - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

SAM Format Deep Dive

Once you have your reads aligned (e.g. with bwa mem or bowtie2), your aligner will output a SAM file. SAM (Sequence Alignment/Map) is a TAB-delimited text format where each line describes one alignment. Below we break down the key parts.

1. Mandatory Fields

Every alignment line in SAM must have at least these 11 columns:

Column	Name	Description
1	QNAME	Query (read) name / identifier
2	FLAG	Bitwise flag encoding orientation + status
3	RNAME	Reference sequence name (e.g. chromosome, contig)
4	POS	1-based leftmost mapping position on reference
5	MAPQ	Mapping quality (phred-scaled)
6	CIGAR	Describes alignment (M/I/D/N/S/H/P/=,X operations)
7	RNEXT	Mate/reference name (“=” if same as RNAME)
8	PNEXT	Position of the mate/next read
9	TLEN	Observed template length (insert size)
10	SEQ	Read sequence
11	QUAL	ASCII-encoded base quality scores

2. SAM FLAG Bits

The FLAG field is a bitwise integer where each bit describes something about the alignment or read:

Bit (hex)	Bit (dec)	Meaning
0x1	1	template has multiple segments (paired)
0x2	2	each segment properly aligned (paired)
0x4	4	this segment is unmapped
0x8	8	next segment in template is unmapped
0x10	16	SEQ is reverse complemented
0x20	32	SEQ of the next segment is reverse comp.
0x40	64	this is the first segment in the template
0x80	128	this is the last segment in the template
0x100	256	secondary alignment
0x200	512	read fails platform/vendor quality checks
0x400	1024	PCR or optical duplicate
0x800	2048	supplementary alignment

Tip: To decode a FLAG, convert it to binary and note which bits are set.

3. CIGAR Strings

The CIGAR column encodes how the read aligns to the reference:

CIGAR Op	Meaning
M	alignment match (can be a match or mismatch)
I	insertion to the reference
D	deletion from the reference
N	skipped region from the reference (e.g. intron)
S	soft clipping (clipped sequences present in SEQ)
H	hard clipping (clipped sequences not present in SEQ)
P	padding (silent deletion from padded reference)
=	sequence match
X	sequence mismatch

Examples:

100M → 100 bp aligned (match/mismatch)
50M10I40M → 50 bp match, 10 bp insertion, 40 bp match
10S90M → first 10 bp soft-clipped, 90 bp aligned

4. Optional Tags

After the 11 mandatory columns, you may see arbitrary TAG:TYPE:VALUE fields. Common ones include:

TAG	TYPE	Meaning
NM	`i`	Edit distance (# mismatches + indels) between read & reference
MD	`Z`	String encoding mismatched positions (reconstruct alignment)
AS	`i`	Alignment score (aligner-specific)
XS	`A`	Suboptimal alignment strand (e.g. for splice-aware mappers)
RG	`Z`	Read group identifier (for multi-sample pipelines)
BC	`Z`	Barcode sequence (for single-cell / multiplexed data)
SA	`Z`	Other supplementary alignments (chimeric / split reads)

Tip: Use samtools view --tags or grep to filter by specific tags.

With these pieces in hand, you can fully interpret any SAM record:

Who? (QNAME)
Where & how? (RNAME, POS, CIGAR)
Orientation/status? (FLAG)
Quality metrics? (MAPQ, NM, AS)

Understanding SAM’s structure is key to downstream processing (conversion to BAM, sorting, indexing, variant calling, etc.).