Quantification and Count Matrix Generation - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

4.6 Quantification & Count Matrix Generation

After QC and (pseudo-)alignment, the next step is to turn your BAMs or transcript-quant files into a gene-by-sample count matrix for differential expression.


4.6.1 Using featureCounts (from Subread)

# Install
conda install -c bioconda subread

# Run featureCounts on all your sorted BAMs
mkdir -p counts/featureCounts

featureCounts \
  -T 8 \                               # threads
  -a ref/annotations.gtf \             # GTF/GFF annotation
  -o counts/featureCounts/gene_counts.txt \
  align/star/SampleA.Aligned.sortedByCoord.out.bam \
  align/star/SampleB.Aligned.sortedByCoord.out.bam \
  align/star/SampleC.Aligned.sortedByCoord.out.bam

Outputs

  • gene_counts.txt (tab-delimited)
#Geneid      Chr   Start   End     Strand  Length  SampleA  SampleB  SampleC
ENSG000001   1     11869   14409   +       2530    123      98       110
ENSG000002   1     14404   29570   –       15166   456      512      490
  • A summary line reporting total reads, assigned, unassigned, etc.

4.6.2 Using HTSeq-count

# Install
conda install -c bioconda htseq

# One sample at a time (stranded “no” here; adjust -s yes/reverse as needed)
mkdir -p counts/htseq

htseq-count \
  --format=bam \
  --stranded=no \
  --type=exon \
  --idattr=gene_id \
  align/hisat2/SampleA.bam \
  ref/annotations.gtf \
  > counts/htseq/SampleA.counts.txt

# Combine per‐sample files into a single matrix in R or with paste/join

  • Output per sample:
ENSG000001   130
ENSG000002   478
…
__no_feature  2345
__ambiguous    123
__alignment_not_unique   87

4.6.3 Importing Transcript-Level Quantifications with tximport

If you used Salmon or Kallisto, you’ll get transcript-level estimates. To summarize at the gene level in R:

# In R:
# install.packages("tximport"); BiocManager::install("readr")
library(tximport)
library(readr)

# 1. Define your sample names & quant file paths
samples <- c("SampleA","SampleB","SampleC")
tx2gene <- read_tsv("ref/tx2gene.tsv")  
# tx2gene.tsv: two columns: transcript_id   gene_id

files <- file.path("quant/salmon", samples, "quant.sf")
names(files) <- samples

# 2. Import & summarize
txi <- tximport(files, 
                type = "salmon", 
                tx2gene = tx2gene, 
                countsFromAbundance = "lengthScaledTPM")

# 3. Extract your count matrix
gene_counts <- txi$counts
# a matrix of genes (rows) × samples (columns)

  • tx2gene.tsv example:
transcript_id      gene_id
ENST000001         ENSG000001
ENST000002         ENSG000001
ENST000003         ENSG000002

  • txi$counts:
            SampleA   SampleB   SampleC
ENSG000001    142.3     130.7     138.1
ENSG000002    512.0     489.4     498.2

4.6.4 Next Steps

  1. Combine your count matrix with colData (metadata) for DE analysis.
  2. Normalize and explore (see next section on TPM/CPM, PCA).
  3. Store your matrix under counts/ and version-control the metadata script that generated it.