Week 10: GSEA - bcb420-2025/Izumi_Ando GitHub Wiki

⏰ expected - 2 hours : actual 2.75 hours

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

this is the original gsea paper—lays out why single-gene stats can miss the forest for the trees and how sets of genes tell a better story

Citation

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102

Notes

why gsea

single-gene DE tests can miss subtle but coordinated changes
sometimes nothing passes FDR cutoff, or you just get a long, unstructured list of genes
same system studied by two groups = diff top genes, but the underlying biology might still match
ex: modest changes in a pathway might be more biologically important than big change in one gene

the method itself

genes are ranked by correlation to phenotype (any stat works)
enrichment score (ES) is a running sum across ranked list, walking up when you hit a gene in the set, down otherwise
max deviation from zero = ES
does a permutation test (by shuffling class labels) to assess significance
adjusts for multiple testing using FDR, not FWER (too conservative)
also normalizes ES for gene set size to get NES

example of GSEA result visualization, from figure 1 of paper

things i liked

concept of the leading-edge subset — the core set of genes in a pathway that drives enrichment
these leading-edge genes often turn out to be biologically meaningful on their own
introduced MSigDB with cytogenetic, functional, motif, and neighborhood-based gene sets
still holds up 20 yrs later