4. Assignment 3 - bcb420-2022/Inika_Prasad GitHub Wiki

5. Enrichment Map

  • Cytoscape Version: 3.9.1

  • If you are using GSEAPreranked and you would like to see the expression values in the heat map instead of the ranks change this default setting. How to change the default setting?

  • P-value cutoff: 0.005

  • FDR Q-value cutoff: 0.05

  • Jaccard coefficient vs Overlap Coefficient vs Jaccard + Overlap Combines

  • Default: Overlap Coefficient

  • The .tsv files aren't recognized by Cytoscape. Open with Excel and save as .xls files

  • Error: Parsing Generic Result File. Index 1 out of bounds for Length 1

Lost some unsaved journal entries and work due to my computer crashing, but have redone it now. In summary

  • Use AutoAnnotate to create clusters. WordCloud is used to group together similar terms. A lot of renaming is required to make the clusters readable and representative.

  • Many clusters follow the same biological process, so I deleted some labels in favour of grouping together clusters.

  • Recurrent and Primary networks are very well separated, no interconnections! Strange but interesting. I have therefore saved the networks as PDFs separately.

  • Some of the literature I had found before my laptop crashed:

  • A weird thing: how are the node clusters calculations done? Because at one point I made this note:

The only theme encapsulating nodes from both tumor types is the PID IL23 pathway. PID (Pathway Interaction Database) cites IL23 as an inflammatory cytokine. IL-23 signalling has been shown to play a role in the progression of pre-malignant oral lesions to cancer. Perhaps it plays a similar role in the development of both primary and recurrent tumors. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0196034

But this theme did not come up when I re-made the Enrichment Map. I wonder why.

To cite your use of the GSEA software please reference the following: Subramanian, A., Tamayo, P., et al. (2005, PNAS). Mootha, V. K., Lindgren, C. M., et al. (2003, Nature Genetics).

4. Making my own ranked list

  • Does the p-value in the following formula mean adjusted p-value? Pre- or post- Benjamini-Hockberg correction? -log10(pvalue) * sign(logFC)

  • Probably means the adjusted one, since that's what we used to make decisions about significant in the thresholded gene analysis.

  • So I did not have a list of the adjusted p-values from Assignment 2 since I used the function decideTestsDGE which gives a matrix where each gene is assigned 1,0, or -1 depending on whether it's sig downreg, not sig, or sig upreg.

  • Found function to adjust p-values and actually see said adjusted p-values: p.adjust. Number of significantly up or down reg genes match the results from decideTestsDGE.

  • Make a new dataframe with gene names, adjusted p-values, log FC, and a column with -log10(pvalue) * sign(logFC)

  • Figured out how to save a .rnk file: use writetable function and name the output file with .rnk

  • But GSEA gives the error: Parsing trouble java.lang.NumberFormatException for input string "ranks"

    • The error message mentions a lot about floating decimal... maybe something there?
    • Looking at these EM Tutorials from Bader Lab, it seems maybe it's important to "unlist" the columns-to-be in the rank file. Update: it worked.
  • The rank file's numbers are much smaller than the numbers in the Mesenchymal Immuno dataset... I wonder why. Could it be that we don't use adjusted p-values? Looking at the EM Tutorials from Bader Lab, it seems that the p-valuues from the exact test function are used non-adjusted.

  • Running the GSEA starting 22:39.

  • The genes should be sorted in order of decreasing rank since GSEA just walks up and down the list. But the order function is considering 9.64327466553287e-17 greater than 9.43563325147318... how do I fix this? Answer: use the sort function.

  • GSEA run done :)

3. Doing the GSEA assignment to get familiar with the GSEA app

  • .rnk files should be converted to text files
  • you need a file with your genes and ranks (.txt) and a geneset (you can choose one from GSEA itself or get something like the BaderLab geneset)
  • For RNA-seq data, use permutation by geneset, not phenotype
  • Metric for ranking genes? Options are signal-to-noise ratio (S2N), Ratio, T-test, Pearson, etc.
  • What are the phenotype labels? Mesenchymal and Immuno? Error: No templates specified. Please load a CLS file and choose the phenotype labels. You have to make a .cls file specifying samples, labels, etc. But I don't have such a file. Do I make one?

I googled pre-ranked GSEA and...

2. Running the code

  • From Ruth Isserlin's Example of running GSEA from R for Course BCB420H1S
  • Plan:
  1. Assign ranks to genes based on -log10(pvalue) * sign(logFC). Do I already have this ranked data?
  2. Call Java jar in R. New plan: run analysis on GSEA app
  3. Run GSEA and look at index file (time unknown)
  4. Download Bader lab geneset (.gmt) with HGNC symbols, GO Biological Processes and All Pwathways, and no Inferred from Electronic Annotation.
  5. Limit geneset size between 15 and 200 preliminarily.

Bader Lab dataset from April 01, 2022

1. Setting up and installing necessary software

  • JDK 18 installed for Mac
  • Install GSEA: GSEA v4.2.3 Mac App
  • Cytoscape already installed
  • Cytoscape apps
    • EnrichmentMap, version 3.3.4: Done
    • Clustermaker2, version 2.2: Done
    • WordCloud, version 3.1.4: Done
    • AutoAnnotate, version 1.3.5: Done