SAM Format Deep Dive - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
SAM Format Deep Dive
Once you have your reads aligned (e.g. with bwa mem
or bowtie2
), your aligner will output a SAM file. SAM (Sequence Alignment/Map) is a TAB-delimited text format where each line describes one alignment. Below we break down the key parts.
1. Mandatory Fields
Every alignment line in SAM must have at least these 11 columns:
Column | Name | Description |
---|---|---|
1 | QNAME | Query (read) name / identifier |
2 | FLAG | Bitwise flag encoding orientation + status |
3 | RNAME | Reference sequence name (e.g. chromosome, contig) |
4 | POS | 1-based leftmost mapping position on reference |
5 | MAPQ | Mapping quality (phred-scaled) |
6 | CIGAR | Describes alignment (M/I/D/N/S/H/P/=,X operations) |
7 | RNEXT | Mate/reference name (“=” if same as RNAME) |
8 | PNEXT | Position of the mate/next read |
9 | TLEN | Observed template length (insert size) |
10 | SEQ | Read sequence |
11 | QUAL | ASCII-encoded base quality scores |
2. SAM FLAG Bits
The FLAG field is a bitwise integer where each bit describes something about the alignment or read:
Bit (hex) | Bit (dec) | Meaning |
---|---|---|
0x1 | 1 | template has multiple segments (paired) |
0x2 | 2 | each segment properly aligned (paired) |
0x4 | 4 | this segment is unmapped |
0x8 | 8 | next segment in template is unmapped |
0x10 | 16 | SEQ is reverse complemented |
0x20 | 32 | SEQ of the next segment is reverse comp. |
0x40 | 64 | this is the first segment in the template |
0x80 | 128 | this is the last segment in the template |
0x100 | 256 | secondary alignment |
0x200 | 512 | read fails platform/vendor quality checks |
0x400 | 1024 | PCR or optical duplicate |
0x800 | 2048 | supplementary alignment |
Tip: To decode a FLAG, convert it to binary and note which bits are set.
3. CIGAR Strings
The CIGAR column encodes how the read aligns to the reference:
CIGAR Op | Meaning |
---|---|
M | alignment match (can be a match or mismatch) |
I | insertion to the reference |
D | deletion from the reference |
N | skipped region from the reference (e.g. intron) |
S | soft clipping (clipped sequences present in SEQ) |
H | hard clipping (clipped sequences not present in SEQ) |
P | padding (silent deletion from padded reference) |
= | sequence match |
X | sequence mismatch |
Examples:
100M
→ 100 bp aligned (match/mismatch)50M10I40M
→ 50 bp match, 10 bp insertion, 40 bp match10S90M
→ first 10 bp soft-clipped, 90 bp aligned
4. Optional Tags
After the 11 mandatory columns, you may see arbitrary TAG:TYPE:VALUE
fields. Common ones include:
TAG | TYPE | Meaning |
---|---|---|
NM | i |
Edit distance (# mismatches + indels) between read & reference |
MD | Z |
String encoding mismatched positions (reconstruct alignment) |
AS | i |
Alignment score (aligner-specific) |
XS | A |
Suboptimal alignment strand (e.g. for splice-aware mappers) |
RG | Z |
Read group identifier (for multi-sample pipelines) |
BC | Z |
Barcode sequence (for single-cell / multiplexed data) |
SA | Z |
Other supplementary alignments (chimeric / split reads) |
Tip: Use
samtools view --tags
orgrep
to filter by specific tags.
With these pieces in hand, you can fully interpret any SAM record:
- Who? (
QNAME
) - Where & how? (
RNAME
,POS
,CIGAR
) - Orientation/status? (
FLAG
) - Quality metrics? (
MAPQ
,NM
,AS
)
Understanding SAM’s structure is key to downstream processing (conversion to BAM, sorting, indexing, variant calling, etc.).