Differential Expression - iffatAGheyas/bioinformatics-tutorial-wiki GitHub Wiki
Differential Expression
Once you have a normalized count matrix and your sample metadata in hand, the goal of differential expression (DE) analysis is to identify genes whose expression differs significantly between conditions (e.g. treated vs. control), while accounting for replicates, batch effects, and dispersion.
1. Overview
- Objective: test each gene for a significant change in expression
- Key challenges: small counts, overdispersion, multiple testing
- Popular methods:
- DESeq2 (negative-binomial GLM)
- edgeR (empirical Bayes dispersion)
- limma-voom (linear models on log-CPM after variance-stabilization)
2. Input Data
- Count matrix (
counts
): genes × samples (raw integer counts) - Metadata (
coldata
): samples × covariates (Condition, Batch, etc.)
# counts.tsv
GeneID SampleA SampleB SampleC SampleD
Gene1 120 95 130 110
Gene2 450 512 478 500
…
# coldata.tsv
Sample Condition Batch
SampleA Control 1
SampleB Control 1
SampleC Treated 2
SampleD Treated 2
Load in R:
counts <- read.table("counts/counts.tsv", header=TRUE, row.names=1)
coldata <- read.table("metadata/coldata.tsv", header=TRUE, row.names=1)
3. Model Design & Contrasts
Define a design formula. For a simple treated vs. control with batch:
design <- ~ Batch + Condition
Contrasts specify which comparison you want:
contrast <- c("Condition", "Treated", "Control")
4. DESeq2 Workflow
# 4.1 Install / load
if (!requireNamespace("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)
# 4.2 Create DESeqDataSet
dds <- DESeqDataSetFromMatrix(
countData = counts,
colData = coldata,
design = design
)
# 4.3 Prefilter low counts (optional)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
# 4.4 Run DESeq model
dds <- DESeq(dds)
# 4.5 Extract results for Treated vs Control
res <- results(dds, contrast=contrast)
res <- lfcShrink(dds, contrast=contrast, res=res) # optional shrink
# 4.6 Inspect top genes
head(res)
Sample output (head(res)
):
GeneID | log2FoldChange | pvalue | padj |
---|---|---|---|
Gene45 | 2.35 | 1.2e-05 | 3.4e-05 |
Gene102 | -1.80 | 2.5e-04 | 5.1e-04 |
Gene7 | 1.10 | 3.2e-03 | 8.0e-03 |
… | … | … | … |
5. edgeR Alternative
# Install / load
BiocManager::install("edgeR")
library(edgeR)
# 5.1 Create DGEList
y <- DGEList(counts=counts, samples=coldata, group=coldata$Condition)
# 5.2 Filter & normalize
keep <- filterByExpr(y, design)
y <- y[keep, , keep.lib.sizes=FALSE]
y <- calcNormFactors(y)
# 5.3 Estimate dispersion
y <- estimateDisp(y, design)
# 5.4 Fit GLM & test
fit <- glmFit(y, design)
lrt <- glmLRT(fit, contrast=contrast)
# 5.5 Top tags
topTags(lrt)
6. limma-voom Option
# Install / load
BiocManager::install("limma")
library(limma)
# 6.1 Calculate log‐CPM and precision weights
v <- voom(counts, design, plot=TRUE)
# 6.2 Fit linear model
fit <- lmFit(v, design)
fit <- eBayes(fit)
# 6.3 Extract DE results
topTable(fit, coef="Condition_Treated_vs_Control")
7. Result Filtering & Export
# DESeq2 example: significant calls
sig <- subset(res, padj < 0.05 & abs(log2FoldChange) >= 1)
write.csv(as.data.frame(sig), "results/DE_genes.csv")
8. Visualization
MA‐plot:
plotMA(res, ylim=c(-5,5), main="DESeq2 MA-plot")
Volcano‐plot (using EnhancedVolcano):
BiocManager::install("EnhancedVolcano")
library(EnhancedVolcano)
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'padj',
pCutoff = 0.05,
FCcutoff = 1)
Heatmap of top DE genes:
library(pheatmap)
vsd <- vst(dds, blind=FALSE)
topGenes <- head(order(res$padj), 30)
mat <- assay(vsd)[topGenes, ]
pheatmap(mat,
annotation_col=coldata,
scale="row",
show_rownames=TRUE)
Next:
- Proceed to Functional Enrichment to interpret your DE gene list in terms of pathways or GO terms.
- Always validate a few top hits experimentally if possible.