Assignment 3 Journal - bcb420-2023/Angela_Uzelac GitHub Wiki

Assignment 3 - Data set Pathway and Network Analysis

Objective

document progress while working on Assignment 3

Duration

Day 1: 2 hrs

Day 2: 5 hrs

Day 3: 4 hrs

Day 4: 4 hrs

Day 5: 10 hrs

Procedure

Preparation

re-ran my Assignment 2 code and wrote the qlf_output_hits_withgn table and normalized_count_data table to a file as tsv

Non-thresholded gene set enrichment analysis

Retrieving Gene sets

referred to my GSEA journal for procedure on how to perform pre-ranked analysis
ran the following code to get the gmt file
this gets genesets from baderlab geneset collection from current release containing GO biological process, all pathways, no IEA

# setwd("C:/Users/angel/BCB420")

install.packages("RCurl")
library("RCurl")

gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
# list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred
# from electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(getwd(), gmt_file)
download.file(paste(gmt_url, gmt_file, sep = ""), destfile = dest_gmt_file)

this is saved in my home machine in the path C:\Users\angel\BCB420\Human_GOBP_AllPathways_no_GO_iea_April_02_2023_symbol.gmt, and equivalently in the projects folder on the docker container

Creating ranked gene list file

read file qlf_output_hits_withgn.tsv into variable
calculated rank for each gene then ordered the genes by rank
wrote ranked gene list to a file ranked_genelist.rnk
- note: can be tab-delimited but the file type must be rank so end in .rnk
htmltools::includeHTML() for displaying the summary of results from GSEA

Run GSEA

input the gmt file and the rank file into GSEA
set all parameters then ran GSEA
opened the results file in HTML
compared results between upregulated in na_pos and upregulated in na_neg
- double check: what is na_pos and what is na_neg??
embedded summary of results into report
went into detailed results for both phenotypes and compared that to results from G:profiler - pretty different results

Visualize in cytoscape

followed instructions in the enrichment map pipeline resource
apps > app manager > install enrichment map pipeline. then click apps > enrichment map
loaded the entire folder with GSEA results
this automatically input the required files into required fields
report for na_pos into Enrichments Pos field, same for na_neg
GMT file is the ORIGINAL gmt file, not the filtered
analysis type: GSEA
input ranks file: ranked_gene_list_na_pos_versus_na_neg...
q value cutoff 0.05, check field filter genes by expressions, then click build
- this is the standard threshold
to check number of nodes and edges: see node table and edge table in cytoscape at the bottom, then click export, export to csv, then open the csv in excel and check the number of rows ( but -1 because first row is heading)

Auto Annotate

Apps > Auto Annotate > New Annotation Set > Create Annotations
to change the names of the themes: Apps > Word Cloud > Show Word Cloud
increase the normalization factor to get rid of words like pathway or regulation
manually exclude words
manually change the theme names

Publication Ready Figure

could follow protocol in EnrichmentMap Pipeline at the end of the Navigating and interpreting the enrichment map slide
in AutoAnnotate tab just clicked "Publication Ready"
downloaded svg of legend example then edited in powerpoint then added it manually to the publication-ready figure

Collapse into themes

Edit > Preferences > Group preferences and select “Enable attribute aggregation"
in menu of auto annotate: collapse all
view > show tool panel > in scale slider slide left to make nodes clustered together
manually moved them around to put similar themes together

Results

GSEA

Some of the top terms for genes that are upregulated in Schizophrenia were "TYROBP CAUSAL NETWORK IN MICROGLIA", "RHO GTPASES ACTIVATE WASPS AND WAVES", and "REGULATION OF PHAGOCYTOSIS".
The top terms for genes that are downregulated in disease were "COLLAGEN CHAIN TRIMERIZATION", "ASSEMBLY OF COLLAGEN FIBRILS AND OTHER MULTIMERIC STRUCTURES", and "SYNAPTIC_VESICLE_TRAFFICKING"
not really similar to g:profiler results, but also not a straightforward comparison

Visualizing GSEA results

only about 27 nodes, is this enough? heatmap not really showing up, don't have columns of the samples, what is wrong?
only 1 red (upregulated), rest are blue
changing the annotations on wordcloud is not working
followed protocol and my map does not look like the one in the slides
heatmap not showing up
- remember to remove quotes in the txt file because sometimes doesn't recognize things
April 17: fixed all problems above.
Number of nodes in the enrichment map: 67
Number of edges in the enrichment map: 89
approx half blue half red. not too interconnected and not too many nodes

Pathway Analysis

apps > install WikiPathways
import > import from database > choose wiki pathways > type TYROBP causal network in microglia > choose homo sapiens > import as pathway
import > import table from file > qlf output hits file that has log fc and p value
- remember to remove quotes in the txt file because sometimes doesn't recognize things

Conclusion and Outlook

results are relatively the same as in Assignment 2
results are the same as in original paper: also talked about synaptic vesicle trafficking and dopamine synthesis regulation
dysregulation of these two has been shown to lead to psychotic symptoms