Entry 10.1: GSEA Assignment - bcb420-2025/Chloe_Calica GitHub Wiki

Objective: Perform a GSEA preranked analysis given the ranked list comparing mesenchymal and immunoreactive ovarian cancer subtypes

Expected Time: 2 hrs

Actual Time: 1.5 hrs

GSEA Parameters:

mesenchymal vs immuno rank fileLinks to an external site.
genesets from the baderlab genesetLinks to an external site. collection from January 1, 2025 containing GO biological process, no IEA and pathways.
- Used most recent instead from March 1, 2025. See error below.
maximum geneset size of 200
minimum geneset size of 15
gene set permutation (?)

Maximum geneset size of 200
- Definition: Exclude larger sets. Default in GSEA is 500.
- Using a value of 200 means we are decreasing the amount of large sets in our enrichment analysis since large sets can usually dominate the results, masking the smaller, more specific pathways.
- By choosing 200, we ensure that we do not get broad, less informative sets and avoidant redundant sets that have overlaps in multiple pathways.
Minimum geneset size of 15
- Definition: Exclude smaller sets. Default in GSEA is 15.
- Very small gene sets are more susceptible to random noise. With fewer genes, the enrichment score becomes unstable as they become inflated.
- A minimum of 15 ensures that there are enough genes in the set to generate a stable score while also ensuring that the pathways we get are meaningful and not fragmented i.e. partial/incomplete pathways.
Gene set permutation = 2000:
- Number was not provided in the assignment. The GSEA tutorial says to do 100, the lecture/paper on GSEA says a 1000, but it said to do 2000 when running our own dataset.
- Definition: This paramater is the number of times that the gene-sets will be randomized in order to create a null distribution to calculate the FDR.
- Picked 2000 since it's not too big and not too little.
  - Too few permutations can result to a poorly estimated null distribution.
  - More permutations can improve the null distribution slightly, but it may not justify the added computational cost.

Did Mesenchymal as na_pos (first one) and Immunoreactive as na_neg (second result)

	Mesenchymal sub type	Immunoreactive subtype
Top Gene Set	HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION% MSIGDBHALLMARK% HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION	HALLMARK_INTERFERON_ALPHA_RESPONSE% MSIGDBHALLMARK% HALLMARK_INTERFERON_ALPHA_RESPONSE
Pvalue	Nominal: 0.0 FWER: 0.0	Nominal: 0.0 FWER: 0.0
ES	0.86477774	-0.8557666
NES	2.5517595	-2.9741802
FDR	0.0	0.0
Genes in Leading Edge	56%	73%
Top Gene	FBN1 Rank in List: 4, Rank Metric Score: 32.4 Running ES: 0.0234	PROCR Rank in List: 1960, Rank Metric Score: 2.513 Running ES: -0.1249

I initially got this error when I loaded the files.
Apparently there are repeats on the first column, I'm not sure how to fix it so I opted to use the most recent gene set from the Bader Lab.
- Human_GOBP_AllPathways_noPFOCR_no_GO_iea_March_01_2025_symbol.gmt
- This one worked well on first try so just went with this one.