Assignment #3: Data set Pathway and Network Analysis - bcb420-2023/Angela_Ng GitHub Wiki

Objective

Conduct non-thresholded gene set enrichment analysis.
Visualize gene set enrichment analysis in Cytoscape.
Perform interpretation and view results in greater detail.
Perform a post-analysis looking at one of:

Specific transcription factors, microRNAs or drugs.
Choose a specific pathway or theme to investigate in more detail.
Look at the dark matter.

Duration

Estimated time: 12 hours
Actual time: 16 hours

Procedure

Get the differential expression data from Assignment #2 and calculate the ranked gene list.
Use GSEA in R to get a set of enriched genesets.
Answer questions in GSEA section of assignment.
In Cytoscape use the EnrichmentMap app to create an enrichment map.
Install WikiPathways in Cytoscape to annotate the network with the AutoAnnotate app.
Collapse the network with AutoAnnotate's "Collapse All" feature.
Answer questions regarding the Cytoscape portion.
Answer questions in the Interpretation and detailed view of results section of assignment.
Download the Bader Lab gene sets for DrugTargets under Human.
Create gmt file with genes of interest reported from literature used to treat heart disease.
Add the selected drugs as signature gene sets to the network.
Answer post-analysis section questions.

Results

The non-thresholded gene set enrichment analysis showed dominant themes in gene expression, specifically translation in the mitochondria and metabolism of molecules important to the extra cellular matrix. These results similar to that of the thresholded analysis but are more specific. They are in accordance with those of the original authors and other publications.

The network generated from the non-thresholded GSEA showed these themes and additional themes in epidermal growth factor, arrhythmia, cell signalling, and immune system. These are reasonable given the diseases being studied here, dilated cardiomyopathy (DCM) and ischemic cardiomyopathy (ICM). Annotating this network with the drugs verapamil, lidocaine, mexiletine, sacubitril, valsartan, digoxin, atenolol, nebivolol, and sotalol showed lidocaine had the most interactions. Lidocaine interacted with an epidermal growth factor receptor. This is consistent with the expected drug targets as most of these drugs are aimed at making it easier for the heart to circulate blood through the body or arrhythmias.

Overall, the results in this assignment are consistent with those of the original authors and other literature regarding DCM and ICM. The original authors were interested in the subtleties of ICM and DCM thus they did an analysis for each pair of cohorts (DCM vs ICM, DCM vs NF, and ICM vs NF). This was not done here due to time constraints but it would be interesting to compare the results and I suspect subtleties that differentiate the two cardiomyopathies will be uncovered.

Outlook

Throughout the course of these three assignments I was able to learn about what steps are required in doing an RNASeq analysis and gained experience working with experimental data published by others. I learned the importance of having clear and detailed documentation for reproducibility, which I found was lacking in the data set I chose. I was able to replicate the results of the author but it wasn't without guessing and checking to see what made sense.

Throughout the assignments I noticed there was also a lot of freedom given such as in choosing the data set and what thresholds or statistical tests best fit our data. This is nice as I got to experiment and see the different effects. I would like to learn more about why these statistical tests are used, the meaning of the specific parameters, and what makes a good data set.

This experience has also highlighted the importance of familiarizing yourself with the tools and methods before using them and what specific parameters mean since I found there is a lot of nuance in these analysis. For example, when creating the enrichment map by changing the FDR q-value you can drastically change the number of nodes and edges. Knowing the statistical tests and expected results are also useful in knowing how to interpret the results, knowing if the results are unexpected, and where to look if the results are unexpected.

Overall, I had a positive experience and am happy I was able to replicate some of the results as the authors and other studies. In the future it would be interesting to see how the results would change by changing some parameter such as the number of clusters or cutoffs for significance and comparing pairs of cohorts to see the nuances between DCM and ICM.

Issues Encountered

When annotating the enrichment map with AutoAnnotate and the option to layout clusters to minimize overlap was chosen it took a very long time. Instead, I opted to not select that box and it was much faster to compute.
The above issue was encountered when Cytoscape wasn't able to properly read in the GSEA results. Instead of letting it automatically fill the fields I specified the positive results to be the edb folder of the GSEA results and the negative results to be nothing since it contained the results in the edb folder.
Prior to doing this there were ~4000 nodes.

Key Notes

Initially tried q-value of 0.1 but very few nodes. When tried to make graph sparser by decreasing the p and q-values there were even fewer nodes. To get more nodes and edges in the graph the q-value needs to be made more permissive in the enrichment map creation step. Once the graph is created the max p and q-values correspond to the max value given when the network was created.
To create the summary network go to AutoAnnotate panel on the left and click the hamburger menu then create summary networks.

References

BaderLab. 2023. “Enrichment Map Gene Sets.” 2023. https://baderlab.org/GeneSets. 2.Dang, Haiming, Yicong Ye, Xiliang Zhao, and Yong Zeng. 2020. “Identification of Candidate Genes in Ischemic Cardiomyopathy by Gene Expression Omnibus Database.” BMC Cardiovascular Disorders 20 (1): 1–10.
DrugBank. 2023a. “Lidocaine.” 2023. https://go.drugbank.com/drugs/DB00281. ———. 2023b. “Soltalol.” 2023. https://go.drugbank.com/drugs/DB00489. ———. 2023c. “Verapamil.” 2023. https://go.drugbank.com/drugs/DB00661.
Frangogiannis, Nikolaos G. 2019. “The Extracellular Matrix in Ischemic and Nonischemic Heart Failure.” Circulation Research 125 (1): 117–46.
Jüri Reimand, Veronique Voisin, Ruth Isserlin. 2023. “Pathway Enrichment Analysis and Visualization of Omits Data Using g:profiler, GSEA and Enrichment Map in Cytoscape.” 2023. https://cytoscape.org/cytoscape-tutorials/protocols/enrichmentmap-pipeline/#/.
Korotkevich, Gennady, Vladimir Sukhov, Nikolay Budin, Boris Shpak, Maxim N Artyomov, and Alexey Sergushichev. 2016. “Fast Gene Set Enrichment Analysis.” BioRxiv, 060012.
Mantziari, Lilian, Antonis Ziakas, Ioannis Ventoulis, Vasileios Kamperidis, Leonidas Lilis, Niki Katsiki, Savvato Karavasiliadou, et al. 2012. “Differences in Clinical Presentation and Findings Between Idiopathic Dilated and Ischaemic Cardiomyopathy in an Unselected Population of Heart Failure Patients.” The Open Cardiovascular Medicine Journal 6: 98.
MayoClinic. 2023. “Dilated Cardiomyopathy.” 2023. https://cytoscape.org/cytoscape-tutorials/protocols/enrichmentmap-pipeline/#/.
Morgan, Martin. 2022. BiocManager: Access the Bioconductor Project Package Repository. https://cran.r-project.org/package=BiocManager.
Müller, Kirill, and Hadley Wickham. 2022. Tibble: Simple Data Frames. R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.r-project.org/.
Reimand, Jüri, Ruth Isserlin, Veronique Voisin, Mike Kucera, Christian Tannus-Lopes, Asha Rostamianfar, Lina Wadi, et al. 2019. “Pathway Enrichment Analysis and Visualization of Omics Data Using g: Profiler, GSEA, Cytoscape and EnrichmentMap.” Nature Protocols 14 (2): 482–517.
Ricard-Blum, Sylvie, and Serge Perez. 2022. “Glycosaminoglycan Interaction Networks and Databases.” Current Opinion in Structural Biology 74: 102355.
Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
Rosenbaum, Andrew N, Katherine E Agre, and Naveen L Pereira. 2020. “Genetics of Dilated Cardiomyopathy: Practical Implications for Heart Failure Management.” Nature Reviews Cardiology 17 (5): 286–97.
Shannon, Paul, Andrew Markiel, Owen Ozier, Nitin S Baliga, Jonathan T Wang, Daniel Ramage, Nada Amin, Benno Schwikowski, and Trey Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13 (11): 2498–2504.
Sweet, Mary E, Andrea Cocciolo, Dobromir Slavov, Kenneth L Jones, Joseph R Sweet, Sharon L Graw, T Brett Reece, et al. 2018. “Transcriptome Analysis of Human Heart Failure Reveals Dysregulated Cell Adhesion in Dilated Cardiomyopathy and Activated Immune Pathways in Ischemic Heart Failure.” BMC Genomics 19: 1–14.
Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation. https://cran.r-project.org/package=dplyr.
Wickham, Hadley, and Maintainer Hadley Wickham. 2017. “Package ‘Tidyr’.” Easily Tidy Data with’spread’and’gather ()’Functions.