GSEA - bcb420-2023/Angela_Uzelac GitHub Wiki

Objective

  • become familiar with using GSEA (one of the methods for Gene Set Enrichment Analysis)

Duration

Expected: 1hr

Actual: 3.5 hrs

Procedure

To install GSEA:

  • went to GSEA download site
  • downloaded for windows
  • kept telling me the download failed because virus detected
  • fixed this by setting ScanWithAntiVirus to 1. Instructions found here
  • restarted laptop
  • successfully downloaded
  • set back to 3 in the same way and restart laptop again

To perform GSEA pre-ranked analysis:

  • in GSEA, under tools there is "Run GSEA Preranked", clicked that
  • take the ranked gene list comparing mesenchymal and immunoreactive ovarian cancer subtypes (mesenchymal genes: positive scores, immunoreactive genes: negative score):
  • get genesets from baderlab geneset collection from current release containing GO biological process, all pathways, no IEA
    • ran the following code in R to get this file
# setwd("C:/Users/angel/BCB420")

install.packages("RCurl")
library("RCurl")

gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred
# from electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(getwd(), gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)
* now have file called Human_GOBP_AllPathways_no_GO_iea_March_02_2023_symbol.gmt
  • load this file into GSEA
    • nav bar on left side click Load Data
    • browse files and select the file above
  • go back to GSEA preranked
  • click dots on the side of gene sets database field
  • go to Local GMT tab
  • select the file that was loaded
  • go to Load Data again and choose the mesenchymal vs immuno rank file
  • go back to GSEA preranked
  • choose this file as Ranked List
  • instructions say to select gene set permutation, but in user it says that In GSEAPreranked, permutations are always done by gene set. So was left as is.
  • select collapse field to be No_Collapse, otherwise will require a chip platform
  • expand basic fields
    • set max geneset size to 200
    • set min geneset size to 15
  • click run
  • clicked on the results in GSEA and took me to a summary file of the results
  • clicked on detailed results in HTML to see the top gene sets for the sub types
  • clicked on details in the second column first row of this table to get the rest of the answers for question 2
  • looked at number of genes where core enrichment is yes to see how many genes in leading edge

Results

Top of results

Enrichment in phenotype: na

3248 / 5749 gene sets are upregulated in phenotype na_pos

1507 gene sets are significant at FDR < 25%

854 gene sets are significantly enriched at nominal pvalue < 1%

1196 gene sets are significantly enriched at nominal pvalue < 5%

Snapshot of enrichment results

Detailed enrichment results in html format

Detailed enrichment results in TSV format (tab delimited text)

Guide to interpret results

Enrichment in phenotype: na

2501 / 5749 gene sets are upregulated in phenotype na_neg

1397 gene sets are significantly enriched at FDR < 25%

753 gene sets are significantly enriched at nominal pvalue < 1%

1079 gene sets are significantly enriched at nominal pvalue < 5%

Snapshot of enrichment results

Detailed enrichment results in html format

Detailed enrichment results in TSV format (tab delimited text)

Guide to interpret results

1. Explain the reasons for using each of the above parameters.

  • used our own gene set database because it grabs from many diff sources, including GO which updates monthly so this compiles monthly too. This is important because annotations are changing constantly. Also uses other pathway databases like WikiPathways. It's better to use this because it has a lot more pathways and is more robust, updated frequently.
  • want to use most recent annotations so took current_release
  • chose GO bp because we are interested in pathways
  • don't want IEA because the results are not as strong, don't want what's inferred
  • decreased gene set size: a lot of pathways that are huge and don't offer much for annotation, we want more specific gene sets, smaller and more meaningful results
  • gene set permutation because we're not using their phenotype permutation because it's not optimized for RNASeq data

2. What is the top gene set returned for the Mesenchymal sub type? What is the top gene set returned for the Immunoreactive subtype? For each of the genesets answer the below questions:

a) What is its pvalue, ES, NES and FDR associated with it. b) How many genes in its leading edge? c) What is the top gene associated with this geneset.

We had Mesenchymal as positive and Immunoreactive as negative

Mesenchymal

top gene set: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION

p-value: 0.0

ES: 0.86580086

NES: 2.5675566

FDR: 0.0

Number of genes in leading edge: 83

top gene associated with the gene set: FBN1?

Immunoreactive

top gene set: HALLMARK_INTERFERON_ALPHA_RESPONSE

pvalue: 0.0

ES: -0.8560622

NES: -2.8804908

FDR: 0.0

Number of genes in leading edge: 57

top gene associated with the gene set: PROCR?

Conclusion and Outlook

  • Question: how do you know what the top gene associated with the gene set is?
  • GSEA is great tool for non-thresholded ORA