Read Alignment or Pseudo‐alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
4.4 Read Alignment & Pseudo-alignment
Mapping your trimmed reads to a reference (genome or transcriptome) is a key step. You have two main approaches:
-
Genome-based alignment (full spliced mapping → BAM)
– STAR or HISAT2
– produces BAM files for QC, visualisation, and gene-level counting -
Alignment-free quantification (quasi-mapping → transcript counts)
– Salmon or Kallisto
– directly estimates transcript abundances (fast, low RAM)
4.4.1 Genome-based: STAR
Installation
# Install
conda install -c bioconda star
# 1. Index the genome (do once)
STAR --runThreadN 8 \
--runMode genomeGenerate \
--genomeDir ref/star_index \
--genomeFastaFiles ref/genome.fa \
--sjdbGTFfile ref/annotations.gtf \
--sjdbOverhang 100
# 2. Align paired-end reads
mkdir -p align/star
STAR --runThreadN 8 \
--genomeDir ref/star_index \
--readFilesIn trimmed/SampleA_R1.fastq.gz trimmed/SampleA_R2.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix align/star/SampleA. \
--outSAMtype BAM SortedByCoordinate \
--quantMode GeneCounts
# Outputs:
# - align/star/SampleA.Aligned.sortedByCoord.out.bam (sorted BAM)
# - align/star/SampleA.Aligned.sortedByCoord.out.bam.bai (index)
# - align/star/SampleA.GeneCounts.tsv (raw gene‐level counts)
HISAT2
# Install
conda install -c bioconda hisat2
# 1. Build the index
hisat2-build ref/genome.fa ref/hisat2_index/genome
# 2. Align reads & sort
mkdir -p align/hisat2
hisat2 -p 8 \
--dta \ # ideal for transcriptome assembly
-x ref/hisat2_index/genome \
-1 trimmed/SampleA_R1.fastq.gz \
-2 trimmed/SampleA_R2.fastq.gz \
| samtools sort -@4 -o align/hisat2/SampleA.bam -
samtools index align/hisat2/SampleA.bam
# Outputs:
# - align/hisat2/SampleA.bam (sorted BAM)
# - align/hisat2/SampleA.bam.bai (index)
4.4.2 Alignment-free Quantification
Salmon
# Install
conda install -c bioconda salmon
# 1. Index the transcriptome
salmon index \
-t ref/transcripts.fa \
-i ref/salmon_index \
--type quasi \
--kmerLen 31
# 2. Quantify abundances
mkdir -p quant/salmon/SampleA
salmon quant \
-i ref/salmon_index \
-l A \
-1 trimmed/SampleA_R1.fastq.gz \
-2 trimmed/SampleA_R2.fastq.gz \
-p 8 \
--validateMappings \
-o quant/salmon/SampleA
# Outputs in quant/salmon/SampleA/:
# - quant.sf (TPM, raw counts, effective lengths)
# - lib_format_metrics.json
Kallisto
# Install
conda install -c bioconda kallisto
# 1. Build the index
kallisto index \
-i ref/kallisto.idx \
ref/transcripts.fa
# 2. Quantify (with 100 bootstraps)
mkdir -p quant/kallisto/SampleA
kallisto quant \
-i ref/kallisto.idx \
-o quant/kallisto/SampleA \
-b 100 \
trimmed/SampleA_R1.fastq.gz \
trimmed/SampleA_R2.fastq.gz
# Outputs in quant/kallisto/SampleA/:
# - abundance.tsv (TPM, estimated counts)
# - bootstrap/ (per-bootstrap estimates)
Comparison
Feature | STAR / HISAT2 | Salmon / Kallisto |
---|---|---|
Splice-aware | Yes | Implicit via transcriptome |
Output | BAM + GeneCounts.tsv |
quant.sf / abundance.tsv |
Speed | Moderate (10–30 min/sample) | Fast (seconds–minutes) |
Memory | High (≥ 16 GB) | Low (4–8 GB) |
Downstream | featureCounts , HTSeq |
tximport , tximeta |
Choose STAR/HISAT2 if you need BAMs for QC, visualisation or novel-splice discovery. Opt for Salmon/Kallisto if you only need fast transcript/gene-level quantification.