Differential Expression - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki

Differential Expression

Once you have a normalized count matrix and your sample metadata in hand, the goal of differential expression (DE) analysis is to identify genes whose expression differs significantly between conditions (e.g. treated vs. control), while accounting for replicates, batch effects, and dispersion.

1. Overview

Objective: test each gene for a significant change in expression
Key challenges: small counts, overdispersion, multiple testing
Popular methods:
- DESeq2 (negative-binomial GLM)
- edgeR (empirical Bayes dispersion)
- limma-voom (linear models on log-CPM after variance-stabilization)

2. Input Data

Count matrix (counts): genes × samples (raw integer counts)
Metadata (coldata): samples × covariates (Condition, Batch, etc.)

# counts.tsv
GeneID   SampleA   SampleB   SampleC   SampleD
Gene1      120        95       130       110
Gene2      450       512       478       500
…

# coldata.tsv
Sample    Condition   Batch
SampleA   Control     1
SampleB   Control     1
SampleC   Treated     2
SampleD   Treated     2

Load in R:

counts  <- read.table("counts/counts.tsv", header=TRUE, row.names=1)
coldata <- read.table("metadata/coldata.tsv", header=TRUE, row.names=1)

3. Model Design & Contrasts

Define a design formula. For a simple treated vs. control with batch:

design <- ~ Batch + Condition

Contrasts specify which comparison you want:

contrast <- c("Condition", "Treated", "Control")

4. DESeq2 Workflow

# 4.1 Install / load
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)

# 4.2 Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(
  countData = counts,
  colData   = coldata,
  design    = design
)

# 4.3 Prefilter low counts (optional)
keep <- rowSums(counts(dds)) >= 10
dds  <- dds[keep,]

# 4.4 Run DESeq model
dds <- DESeq(dds)

# 4.5 Extract results for Treated vs Control
res <- results(dds, contrast=contrast)
res <- lfcShrink(dds, contrast=contrast, res=res)  # optional shrink

# 4.6 Inspect top genes
head(res)

Sample output (head(res)):

GeneID	log2FoldChange	pvalue	padj
Gene45	2.35	1.2e-05	3.4e-05
Gene102	-1.80	2.5e-04	5.1e-04
Gene7	1.10	3.2e-03	8.0e-03
…	…	…	…

5. edgeR Alternative

# Install / load
BiocManager::install("edgeR")
library(edgeR)

# 5.1 Create DGEList
y <- DGEList(counts=counts, samples=coldata, group=coldata$Condition)

# 5.2 Filter & normalize
keep <- filterByExpr(y, design)
y    <- y[keep, , keep.lib.sizes=FALSE]
y    <- calcNormFactors(y)

# 5.3 Estimate dispersion
y <- estimateDisp(y, design)

# 5.4 Fit GLM & test
fit <- glmFit(y, design)
lrt <- glmLRT(fit, contrast=contrast)

# 5.5 Top tags
topTags(lrt)

6. limma-voom Option

# Install / load
BiocManager::install("limma")
library(limma)

# 6.1 Calculate log‐CPM and precision weights
v <- voom(counts, design, plot=TRUE)

# 6.2 Fit linear model
fit <- lmFit(v, design)
fit <- eBayes(fit)

# 6.3 Extract DE results
topTable(fit, coef="Condition_Treated_vs_Control")

7. Result Filtering & Export

# DESeq2 example: significant calls
sig <- subset(res, padj < 0.05 & abs(log2FoldChange) >= 1)
write.csv(as.data.frame(sig), "results/DE_genes.csv")

8. Visualization

MA‐plot:

plotMA(res, ylim=c(-5,5), main="DESeq2 MA-plot")

Volcano‐plot (using EnhancedVolcano):

BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)

EnhancedVolcano(res,
                lab = rownames(res),
                x = 'log2FoldChange',
                y = 'padj',
                pCutoff = 0.05,
                FCcutoff = 1)

Heatmap of top DE genes:

library(pheatmap)
vsd <- vst(dds, blind=FALSE)
topGenes <- head(order(res$padj), 30)
mat <- assay(vsd)[topGenes, ]
pheatmap(mat,
         annotation_col=coldata,
         scale="row",
         show_rownames=TRUE)

Next:

Proceed to Functional Enrichment to interpret your DE gene list in terms of pathways or GO terms.
Always validate a few top hits experimentally if possible.