GSEA - bcb420-2023/Angela_Uzelac GitHub Wiki

Objective

become familiar with using GSEA (one of the methods for Gene Set Enrichment Analysis)

Duration

Expected: 1hr

Actual: 3.5 hrs

Procedure

To install GSEA:

went to GSEA download site
downloaded for windows
kept telling me the download failed because virus detected
fixed this by setting ScanWithAntiVirus to 1. Instructions found here
restarted laptop
successfully downloaded
set back to 3 in the same way and restart laptop again

To perform GSEA pre-ranked analysis:

in GSEA, under tools there is "Run GSEA Preranked", clicked that
take the ranked gene list comparing mesenchymal and immunoreactive ovarian cancer subtypes (mesenchymal genes: positive scores, immunoreactive genes: negative score):
- mesenchymal vs immuno rank file
- went into the student wiki repo and downloaded files as zip, then extracted MesenvsImmuno_RNASeq_ranks.rnk
get genesets from baderlab geneset collection from current release containing GO biological process, all pathways, no IEA
- ran the following code in R to get this file

# setwd("C:/Users/angel/BCB420")

install.packages("RCurl")
library("RCurl")

gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred
# from electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(getwd(), gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)

* now have file called Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt

load this file into GSEA
- nav bar on left side click Load Data
- browse files and select the file above
go back to GSEA preranked
click dots on the side of gene sets database field
go to Local GMT tab
select the file that was loaded
go to Load Data again and choose the mesenchymal vs immuno rank file
go back to GSEA preranked
choose this file as Ranked List
instructions say to select gene set permutation, but in user it says that In GSEAPreranked, permutations are always done by gene set. So was left as is.
select collapse field to be No_Collapse, otherwise will require a chip platform
expand basic fields
- set max geneset size to 200
- set min geneset size to 15
click run
clicked on the results in GSEA and took me to a summary file of the results
clicked on detailed results in HTML to see the top gene sets for the sub types
clicked on details in the second column first row of this table to get the rest of the answers for question 2
looked at number of genes where core enrichment is yes to see how many genes in leading edge

Results

Top of results

Enrichment in phenotype: na

3248 / 5749 gene sets are upregulated in phenotype na_pos

1507 gene sets are significant at FDR < 25%

854 gene sets are significantly enriched at nominal pvalue < 1%

1196 gene sets are significantly enriched at nominal pvalue < 5%

Snapshot of enrichment results

Detailed enrichment results in html format

Detailed enrichment results in TSV format (tab delimited text)

Guide to interpret results

Enrichment in phenotype: na

2501 / 5749 gene sets are upregulated in phenotype na_neg

1397 gene sets are significantly enriched at FDR < 25%

753 gene sets are significantly enriched at nominal pvalue < 1%

1079 gene sets are significantly enriched at nominal pvalue < 5%

Snapshot of enrichment results

Detailed enrichment results in html format

Detailed enrichment results in TSV format (tab delimited text)

Guide to interpret results

1. Explain the reasons for using each of the above parameters.

used our own gene set database because it grabs from many diff sources, including GO which updates monthly so this compiles monthly too. This is important because annotations are changing constantly. Also uses other pathway databases like WikiPathways. It's better to use this because it has a lot more pathways and is more robust, updated frequently.
want to use most recent annotations so took current_release
chose GO bp because we are interested in pathways
don't want IEA because the results are not as strong, don't want what's inferred
decreased gene set size: a lot of pathways that are huge and don't offer much for annotation, we want more specific gene sets, smaller and more meaningful results
gene set permutation because we're not using their phenotype permutation because it's not optimized for RNASeq data

2. What is the top gene set returned for the Mesenchymal sub type? What is the top gene set returned for the Immunoreactive subtype? For each of the genesets answer the below questions:

a) What is its pvalue, ES, NES and FDR associated with it. b) How many genes in its leading edge? c) What is the top gene associated with this geneset.

We had Mesenchymal as positive and Immunoreactive as negative

Mesenchymal

top gene set: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION

p-value: 0.0

ES: 0.86580086

NES: 2.5675566

FDR: 0.0

Number of genes in leading edge: 83

top gene associated with the gene set: FBN1?

Immunoreactive

top gene set: HALLMARK_INTERFERON_ALPHA_RESPONSE

pvalue: 0.0

ES: -0.8560622

NES: -2.8804908

FDR: 0.0

Number of genes in leading edge: 57

top gene associated with the gene set: PROCR?

Conclusion and Outlook

Question: how do you know what the top gene associated with the gene set is?
GSEA is great tool for non-thresholded ORA