Week 10: GSEA - bcb420-2025/Izumi_Ando GitHub Wiki

⏰ expected - 2 hours : actual 2.75 hours

Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles

this is the original gsea paper—lays out why single-gene stats can miss the forest for the trees and how sets of genes tell a better story

Citation

Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., ... & Mesirov, J. P. (2005). Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. PNAS, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102

Notes

why gsea

  • single-gene DE tests can miss subtle but coordinated changes
  • sometimes nothing passes FDR cutoff, or you just get a long, unstructured list of genes
  • same system studied by two groups = diff top genes, but the underlying biology might still match
  • ex: modest changes in a pathway might be more biologically important than big change in one gene

the method itself

  • genes are ranked by correlation to phenotype (any stat works)
  • enrichment score (ES) is a running sum across ranked list, walking up when you hit a gene in the set, down otherwise
  • max deviation from zero = ES
  • does a permutation test (by shuffling class labels) to assess significance
  • adjusts for multiple testing using FDR, not FWER (too conservative)
  • also normalizes ES for gene set size to get NES

image
example of GSEA result visualization, from figure 1 of paper

things i liked

  • concept of the leading-edge subset — the core set of genes in a pathway that drives enrichment
  • these leading-edge genes often turn out to be biologically meaningful on their own
  • introduced MSigDB with cytogenetic, functional, motif, and neighborhood-based gene sets
  • still holds up 20 yrs later