6. GSEA - bcb420-2022/Yuzi_Li GitHub Wiki

Objective

Practice using GSEA on an example data set.

Duration

Expected duration: 2h Actual duration: 3h

Progress

Tasks

Run GSEA on designated ranked list of genes
Answer questions

Running GSEA on genes

First I installed required packages:

#install required R and bioconductor packages
tryCatch(expr = { library("RCurl")}, 
         error = function(e) {  install.packages("RCurl")}, 
         finally = library("RCurl"))
#use library
#make sure biocManager is installed
tryCatch(expr = { library("BiocManager")}, 
         error = function(e) { 
           install.packages("BiocManager")}, 
         finally = library("BiocManager"))
tryCatch(expr = { library("ggplot2")}, 
         error = function(e) { install.packages("ggplot2")}, 
         finally = library("ggplot2"))
#use easy cyRest library to communicate with cytoscape.
tryCatch(expr = { library("RCy3")}, 
         error = function(e) { BiocManager::install("RCy3")}, 
         finally = library("RCy3"))

Then I ran GSEA using the following code:

gsea_jar <- '/home/rstudio/GSEA_4.2.3/gsea-cli.sh'
java_version <- '11'
#Gsea takes a long time to run.  If you have already run GSEA manually or previously there is no need to re-run GSEA.  Make sure the 
# gsea results are in the current directory and the notebook will be able to find them and use them.
#navigate to the directory where you put the downloaded protocol files.
working_dir <- getwd()
# leave blank if you want the notebook to discover the gsea directory for itself
#gsea_directory = paste(working_dir,"Mesen_vs_Immuno.GseaPreranked.1497635459262",sep="/") 
analysis_name <- 'mesen_vs_immuno'
rnk_file <- "mesen_vs_immuno_rank_file.rnk"
dest_gmt_file <- 'Human_GOBP_AllPathways_no_GO_iea_March_01_2021_symbol.gmt'

command <- paste("", gsea_jar,  "GSEAPreRanked -gmx", dest_gmt_file, "-rnk", 
                 file.path(working_dir, rnk_file), "-collapse false -nperm 1000 -scoring_scheme weighted -rpt_label ",
                 analysis_name,"  -plot_top_x 20 -rnd_seed 12345  -set_max 200 -set_min 15 -zip_report false -out" ,
                 working_dir, " > gsea_output.txt",sep=" ")
system(command)

I ran into the following problem: permission denied

> system(command)
sh: 1: /home/rstudio/GSEA_4.2.3/gsea-cli.sh: Permission denied

To fix this, I changed the file permission:

system('chmod 777 /home/rstudio/GSEA_4.2.3/gsea-cli.sh')

Answering questions

Explain the reasons for using each of the above parameters.

Using a geneset containing GO biological process, no IEA and pathways: this gene set is updated on a monthly basis, so it should contain more up-to-date info compared to the GSEA gene sets. GO BP is most suitable for our purpose because we want to know which biological pathways are enhanced/suppressed by the different cell types. We do not want to use IEA unless we do not get enough hits, because including IEA could add noise to our analysis by including potentially inaccurate annotations.
Maximum geneset size of 200: this makes sure that the genesets returned are not overly general
Minimum geneset size of 15: this makes sure that the genesets returned are not overly specific
Gene set permutation: 1000 gene set permutation is used to calculate how significant the NES value is from the random distribution of NES values

What is the top gene set returned for the Mesenchymal sub type?

HALLMARK EPITHELIAL MESENCHYMAL TRANSITION MSIGDB C2 HALLMARK EPITHELIAL MESENCHYMAL TRANSITION

What is its pvalue, ES, NES and FDR associated with it.

p-value: 0.0
ES: 0.8635254
NES: 2.5444539
FDR: 0.0

How many genes in its leading edge?

What is the top gene associated with this geneset.

FBN1

What is the top gene set returned for the Immunoreactive subtype?

HALLMARK INTERFERON ALPHA RESPONSE MSIGDB C2 HALLMARK INTERFERON ALPHA RESPONSE

What is its pvalue, ES, NES and FDR associated with it.

p-value: 0.0
ES: -0.85694104
NES: -2.8643172
FDR: 0.0

How many genes in its leading edge?

What is the top gene associated with this geneset.

GBP4