Entry 10.1: GSEA Assignment - bcb420-2025/Chloe_Calica GitHub Wiki
Objective: Perform a GSEA preranked analysis given the ranked list comparing mesenchymal and immunoreactive ovarian cancer subtypes
Expected Time: 2 hrs
Actual Time: 1.5 hrs
GSEA Parameters:
- mesenchymal vs immuno rank fileLinks to an external site.
- genesets from the baderlab genesetLinks to an external site. collection from January 1, 2025 containing GO biological process, no IEA and pathways.
- Used most recent instead from March 1, 2025. See error below.
- maximum geneset size of 200
- minimum geneset size of 15
- gene set permutation (?)
Assignment Questions
Reasoning For Parameters
- Maximum geneset size of 200
- Definition: Exclude larger sets. Default in GSEA is 500.
- Using a value of 200 means we are decreasing the amount of large sets in our enrichment analysis since large sets can usually dominate the results, masking the smaller, more specific pathways.
- By choosing 200, we ensure that we do not get broad, less informative sets and avoidant redundant sets that have overlaps in multiple pathways.
- Minimum geneset size of 15
- Definition: Exclude smaller sets. Default in GSEA is 15.
- Very small gene sets are more susceptible to random noise. With fewer genes, the enrichment score becomes unstable as they become inflated.
- A minimum of 15 ensures that there are enough genes in the set to generate a stable score while also ensuring that the pathways we get are meaningful and not fragmented i.e. partial/incomplete pathways.
- Gene set permutation = 2000:
- Number was not provided in the assignment. The GSEA tutorial says to do 100, the lecture/paper on GSEA says a 1000, but it said to do 2000 when running our own dataset.
- Definition: This paramater is the number of times that the gene-sets will be randomized in order to create a null distribution to calculate the FDR.
- Picked 2000 since it's not too big and not too little.
- Too few permutations can result to a poorly estimated null distribution.
- More permutations can improve the null distribution slightly, but it may not justify the added computational cost.
Top Gene Sets in Ranked Lists
- Did Mesenchymal as na_pos (first one) and Immunoreactive as na_neg (second result)
Mesenchymal sub type | Immunoreactive subtype | |
---|---|---|
Top Gene Set | HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION% MSIGDBHALLMARK% HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION | HALLMARK_INTERFERON_ALPHA_RESPONSE% MSIGDBHALLMARK% HALLMARK_INTERFERON_ALPHA_RESPONSE |
Pvalue | Nominal: 0.0 FWER: 0.0 | Nominal: 0.0 FWER: 0.0 |
ES | 0.86477774 | -0.8557666 |
NES | 2.5517595 | -2.9741802 |
FDR | 0.0 | 0.0 |
Genes in Leading Edge | 56% | 73% |
Top Gene | FBN1 Rank in List: 4, Rank Metric Score: 32.4 Running ES: 0.0234 | PROCR Rank in List: 1960, Rank Metric Score: 2.513 Running ES: -0.1249 |
Running the GSEA Software
- I initially got this error when I loaded the files.
- Apparently there are repeats on the first column, I'm not sure how to fix it so I opted to use the most recent gene set from the Bader Lab.
- Human_GOBP_AllPathways_noPFOCR_no_GO_iea_March_01_2025_symbol.gmt
- This one worked well on first try so just went with this one.