Read Alignment or Pseudo‐alignment - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

4.4 Read Alignment & Pseudo-alignment

Mapping your trimmed reads to a reference (genome or transcriptome) is a key step. You have two main approaches:

  1. Genome-based alignment (full spliced mapping → BAM)
    – STAR or HISAT2
    – produces BAM files for QC, visualisation, and gene-level counting

  2. Alignment-free quantification (quasi-mapping → transcript counts)
    – Salmon or Kallisto
    – directly estimates transcript abundances (fast, low RAM)


4.4.1 Genome-based: STAR

Installation

# Install
conda install -c bioconda star

# 1. Index the genome (do once)
STAR --runThreadN 8 \
     --runMode genomeGenerate \
     --genomeDir ref/star_index \
     --genomeFastaFiles ref/genome.fa \
     --sjdbGTFfile ref/annotations.gtf \
     --sjdbOverhang 100

# 2. Align paired-end reads
mkdir -p align/star
STAR --runThreadN 8 \
     --genomeDir ref/star_index \
     --readFilesIn trimmed/SampleA_R1.fastq.gz trimmed/SampleA_R2.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix align/star/SampleA. \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts

# Outputs:
# - align/star/SampleA.Aligned.sortedByCoord.out.bam       (sorted BAM)
# - align/star/SampleA.Aligned.sortedByCoord.out.bam.bai   (index)
# - align/star/SampleA.GeneCounts.tsv                      (raw gene‐level counts)

HISAT2

# Install
conda install -c bioconda hisat2

# 1. Build the index
hisat2-build ref/genome.fa ref/hisat2_index/genome

# 2. Align reads & sort
mkdir -p align/hisat2
hisat2 -p 8 \
       --dta \                          # ideal for transcriptome assembly
       -x ref/hisat2_index/genome \
       -1 trimmed/SampleA_R1.fastq.gz \
       -2 trimmed/SampleA_R2.fastq.gz \
  | samtools sort -@4 -o align/hisat2/SampleA.bam -

samtools index align/hisat2/SampleA.bam

# Outputs:
# - align/hisat2/SampleA.bam      (sorted BAM)
# - align/hisat2/SampleA.bam.bai  (index)

4.4.2 Alignment-free Quantification

Salmon

# Install
conda install -c bioconda salmon

# 1. Index the transcriptome
salmon index \
  -t ref/transcripts.fa \
  -i ref/salmon_index \
  --type quasi \
  --kmerLen 31

# 2. Quantify abundances
mkdir -p quant/salmon/SampleA
salmon quant \
  -i ref/salmon_index \
  -l A \
  -1 trimmed/SampleA_R1.fastq.gz \
  -2 trimmed/SampleA_R2.fastq.gz \
  -p 8 \
  --validateMappings \
  -o quant/salmon/SampleA

# Outputs in quant/salmon/SampleA/:
# - quant.sf               (TPM, raw counts, effective lengths)
# - lib_format_metrics.json

Kallisto

# Install
conda install -c bioconda kallisto

# 1. Build the index
kallisto index \
  -i ref/kallisto.idx \
  ref/transcripts.fa

# 2. Quantify (with 100 bootstraps)
mkdir -p quant/kallisto/SampleA
kallisto quant \
  -i ref/kallisto.idx \
  -o quant/kallisto/SampleA \
  -b 100 \
  trimmed/SampleA_R1.fastq.gz \
  trimmed/SampleA_R2.fastq.gz

# Outputs in quant/kallisto/SampleA/:
# - abundance.tsv          (TPM, estimated counts)
# - bootstrap/             (per-bootstrap estimates)

Comparison

Feature STAR / HISAT2 Salmon / Kallisto
Splice-aware Yes Implicit via transcriptome
Output BAM + GeneCounts.tsv quant.sf / abundance.tsv
Speed Moderate (10–30 min/sample) Fast (seconds–minutes)
Memory High (≥ 16 GB) Low (4–8 GB)
Downstream featureCounts, HTSeq tximport, tximeta

Choose STAR/HISAT2 if you need BAMs for QC, visualisation or novel-splice discovery. Opt for Salmon/Kallisto if you only need fast transcript/gene-level quantification.