Quantification and Count Matrix Generation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
4.6 Quantification & Count Matrix Generation
After QC and (pseudo-)alignment, the next step is to turn your BAMs or transcript-quant files into a gene-by-sample count matrix for differential expression.
4.6.1 Using featureCounts (from Subread)
# Install
conda install -c bioconda subread
# Run featureCounts on all your sorted BAMs
mkdir -p counts/featureCounts
featureCounts \
-T 8 \ # threads
-a ref/annotations.gtf \ # GTF/GFF annotation
-o counts/featureCounts/gene_counts.txt \
align/star/SampleA.Aligned.sortedByCoord.out.bam \
align/star/SampleB.Aligned.sortedByCoord.out.bam \
align/star/SampleC.Aligned.sortedByCoord.out.bam
Outputs
gene_counts.txt
(tab-delimited)
#Geneid Chr Start End Strand Length SampleA SampleB SampleC
ENSG000001 1 11869 14409 + 2530 123 98 110
ENSG000002 1 14404 29570 – 15166 456 512 490
- A summary line reporting total reads, assigned, unassigned, etc.
4.6.2 Using HTSeq-count
# Install
conda install -c bioconda htseq
# One sample at a time (stranded “no” here; adjust -s yes/reverse as needed)
mkdir -p counts/htseq
htseq-count \
--format=bam \
--stranded=no \
--type=exon \
--idattr=gene_id \
align/hisat2/SampleA.bam \
ref/annotations.gtf \
> counts/htseq/SampleA.counts.txt
# Combine per‐sample files into a single matrix in R or with paste/join
- Output per sample:
ENSG000001 130
ENSG000002 478
…
__no_feature 2345
__ambiguous 123
__alignment_not_unique 87
4.6.3 Importing Transcript-Level Quantifications with tximport
If you used Salmon or Kallisto, you’ll get transcript-level estimates. To summarize at the gene level in R:
# In R:
# install.packages("tximport"); BiocManager::install("readr")
library(tximport)
library(readr)
# 1. Define your sample names & quant file paths
samples <- c("SampleA","SampleB","SampleC")
tx2gene <- read_tsv("ref/tx2gene.tsv")
# tx2gene.tsv: two columns: transcript_id gene_id
files <- file.path("quant/salmon", samples, "quant.sf")
names(files) <- samples
# 2. Import & summarize
txi <- tximport(files,
type = "salmon",
tx2gene = tx2gene,
countsFromAbundance = "lengthScaledTPM")
# 3. Extract your count matrix
gene_counts <- txi$counts
# a matrix of genes (rows) × samples (columns)
- tx2gene.tsv example:
transcript_id gene_id
ENST000001 ENSG000001
ENST000002 ENSG000001
ENST000003 ENSG000002
- txi$counts:
SampleA SampleB SampleC
ENSG000001 142.3 130.7 138.1
ENSG000002 512.0 489.4 498.2
4.6.4 Next Steps
- Combine your count matrix with
colData
(metadata) for DE analysis. - Normalize and explore (see next section on TPM/CPM, PCA).
- Store your matrix under
counts/
and version-control the metadata script that generated it.