Week 10: GSEA - bcb420-2025/Izumi_Ando GitHub Wiki
⏰ expected - 2 hours : actual 2.75 hours
Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles
this is the original gsea paper—lays out why single-gene stats can miss the forest for the trees and how sets of genes tell a better story
Citation
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102
Notes
why gsea
- single-gene DE tests can miss subtle but coordinated changes
- sometimes nothing passes FDR cutoff, or you just get a long, unstructured list of genes
- same system studied by two groups = diff top genes, but the underlying biology might still match
- ex: modest changes in a pathway might be more biologically important than big change in one gene
the method itself
- genes are ranked by correlation to phenotype (any stat works)
- enrichment score (ES) is a running sum across ranked list, walking up when you hit a gene in the set, down otherwise
- max deviation from zero = ES
- does a permutation test (by shuffling class labels) to assess significance
- adjusts for multiple testing using FDR, not FWER (too conservative)
- also normalizes ES for gene set size to get NES
example of GSEA result visualization, from figure 1 of paper
things i liked
- concept of the leading-edge subset — the core set of genes in a pathway that drives enrichment
- these leading-edge genes often turn out to be biologically meaningful on their own
- introduced MSigDB with cytogenetic, functional, motif, and neighborhood-based gene sets
- still holds up 20 yrs later