Notes on GSEA paper - bcb420-2024/Dien_Nguyen GitHub Wiki
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50. doi: 10.1073/pnas.0506580102. Epub 2005 Sep 30. PMID: 16199517; PMCID: PMC1239896.
- GSEA considers experiments with genome-wide expression profiles from samples belonging to 2 classes, labeled 1 or 2. Genes are ranked based on correlation between expression and class distinction, which results in ranked list L
- Goal of GSEA: For each gene set S, determine whether members of S are randomly distributed throughout L or primarily found at the top or bottom
-
Calculation of enrichment score (ES)
- ES indicates the degree to which set S is overrepresented in list L
- Use a running sum statistic: add when gene in L is in set S, subtract when gene in L not in set S
- ES corresponds to KS-like statistic
-
Estimate significance level of ES
- This is known as the nominal P value.
-
Adjustment for multiple hypothesis testing
- Normalize ES for each gene set, resulting in normalized enrichment score (NES)
- Calculate FDR for each NES to get the false positive probability
- Not all genes of a gene set will usually participate in a biological process
- Leading gene set contains genes in gene set S that appear in list L, before or at the point where running sum reaches max deviation from O --> enrichment signal
- Examination of leading edge subset can reveal a biologically important subset within a gene set