Journal - bcb420-2022/Kilicali_Isildayancan GitHub Wiki

0. Template Journal

  • 0.1 Objective

State the objective of the journal entry.

  • 0.2 Estimated duration

Estimated: XXh

Started: DD-MM-YY, 24:00

Finished: DD-MM-YY, 24:00

  • 0.3 Results and Data

Results, used algorithm, pipeline details etc.

  • 0.4 Conclusions

Conclusions about journal entry

  • 0.5 Extra Material Media, Code, Future directons, cross references, etc.

1. Account Set Up

  • 1.1 Objective

Set up the account wiki and repository in GitHub.

  • 1.2 Estimated duration

Estimated: ??h

Started: 12-01-22, 18:00

Finished: 14-01-22, 18:00

  • 1.3 Results and Data

NA for this journal entry

  • 1.4 Conclusions

GitHub wiki set, and journal+insights! are ready for future entries.

  • 1.5 Extra Material

NA

2. Assignment #2 (Differential expression and Thresholded ORA)

  • 2.1 Objective

Conduct differential expression on the normalized GEO dataset

  • 2.2 Estimated duration

Estimated: 3-6h for differential, 3-6h for ORA

Started: 14-03-22, 14:00

Finished: 17-03-22, 19:45

  • 2.3 Results and Data

  • 2.3.1 Broken gene names fix

Cleaned up further the normalized data, as noted in the report of assignment 1. This was not addressed in the first assignment. In short, some genes (e.g. MARCHF1) were auto-translated to 1-Mar, a date format in excel (I presume). These were semi-manually fixed using a for loop and grep commands in R.

  • 2.3.2 Linear model fit

To evaluate differential expression of genes, statistical methods are used such as a linear model fit for a RNAseq experiment. The data frame was converted to a matrix and passed through the package limma's lmFit and eBayes methods. Unfortunately, I don't fully understand what's under the hood of these functions, but they provide a p-value to evaluate the significance of the difference in expression of genes. Since I have 3 different treatments (healthy, resistant, and non-resistant), it might be best to split the data frame into 3 data frames: healthy vs. resistant; healthy vs. non-resistant; resistant vs. non-resistant. This would perhaps allow the analysis to be more explicit (and I could be more sure of it since I don't know how these functions work). Update: I did all the ones both clustered and separate, and it seems that both the p-value coverage is better, and the heatmap signal looks very strong in all of the cases).

  • 2.3.3 Thresholded Gene enrichment analysis

I tried PANTHR, ENRICHR, and DAVID manually with a set of significantly over-expressed genes. Since all of them produced similar results to each other (and other functionalities of these servers didn't produce any meaningful information (e.g. I hoped that DAVID's PubChem and related database search would produce at least something to do with 5-HT (Serotonin))), I decided to opt out for G:Profiler because the R package was easy to use and it produced very nice interactive Manhattan plots. In the HP (Human Phenotype) part, a lot of neurological disorders came up, but I wasn't exactly sure of the quality of the data so I decided to not include it. Unfortunately, any under-representation assay came up with very little information, and there was no significant results for NRvR condition (both have the disorder, but one is resistant to SSRIs). In the end, ORA didn't produce any new hypothesis, but helped support the authors' and mine, but I am still unsure if this support is nit-picking through the plethora of results/biological processes it produced.

  • 2.4 Conclusions

In conclusion, differential gene expression and thresholded ORA could be a great way to form new hypotheses, investigate new ideas and see patterns of difference between samples. Nevertheless, I am skeptical about my gene count (around 2500 significant genes) versus what the authors reported to see (around 300). Maybe they picked a much more stringent approach to see the really significant expressed genes. In the end, they did only use their expression data to confirm the expression levels of receptors that are known to be directly related to Major Depressive Disorder.

  • 0.5 Extra Material

I think the oligomerization idea with glutamate receptors was a neat idea and I wish I could follow up on that somehow.

3. Assignment #3 (Gene Set Enrichment Analysis, Cytoscape Visualization)

  • 3.1 Objective

Perform gene set enrichment on a set of differentially expressed genes, and visualize them using the Enrichment Map app in Cytoscape.

  • 3.2 Estimated duration

Estimated: 3-6h for GSEA, ??h for Cytoscape

Started: 30-03-22, 15:00

Finished: 6-04-22, ????

  • 3.3 Results and Data

  • 3.3.1 GSVA & EGSEA

Influenced by the questions of the assignment (what method did you use for non-thresholded GSEA?), I was prompted to try something other than the standalone application GSEA. EGSEA failed to load into R, and GSVA turned out to be impossible to understand from their documentation. The main problem was that I didn't know the content and the format of the data they wanted, and when I got something working with the main algorithm function, I did not know the meaning and format of the output and how I'd feed that into Cytoscape. Nevertheless, I learned how GMT was required, and what the GSEA algorithm mainly did with the differentially expressed genes and the gene set file. Basically, GMT is kind of like a dictionary of lists, and each key corresponds to a "gene set" (pathway, biological process, some phenotype, etc.) and the list value is the genes that are associated with that gene set. I still don't understand the ES score calculation and the random walk thing, but I can imagine that the score highlights how represented that gene set is given the differentially expressed genes. All in all, I opted out for simple GSEA, which was incredibly user-friendly (after GSVA, relatively).

  • 3.3.2 Cytoscape; Enrichment Map

The output from the GSEA was carried on to analysis by Cytoscape. Cytoscape functionality is also accessible by R through RCy3, or the REST API that is built in the application. Unfortunately, this seems very time-consuming for a case-by-case analysis like we did in our projects. Nevertheless, if used frequently, the automation option seems like a good way to process a lot of data beforehand to cut down significantly on the effort that would be spent in manual analysis/annotation. From the Cytoscape interface, Enrichment Map developed by Bader Lab (by Professor Isserlin's team) was smooth in using the data directly from GSEA output. All of my analysis was carried out through manual annotation since the original paper that published the RNA-seq data focused on the SSRI-induced calcium hyper-activity in treatment resistant depression (TRD). Of note, the IPSCs that were initially grafted from patients probably had a lot of cancer/stemness related genes differentially regulated, and this presents a big false positive in the data. Because of this, I did not include any metabolic or replication/cancer related pathways in the network. This is both due to the cells being derived from IPSCs, and also from MDD patients not having any published statistics on having more or less cancer inclination overall.

  • 3.3.3 Signature Gene set Post-analysis

For the post-analysis, I decided to use drugs that are commonly prescribed to treat depression. Of personal interest (I will not go into detail here, but sortly: I do not condone the banning of substances that have become taboo because of miss-use and abuse -- also from enlightening talks with a UofT colleague that works in a neurology laboratory on neurotransmitter receptors and mental health disorders), I included ketamine and esketamine (active biological metabolite) in the analysis due to its significance and recently discovered benefits. Interestingly, escitalopram and citalopram did not pass the significance test (along with various other classes of antidepressants), but I included them to have a reference point. From seeing the drug interactions in the network on the gene sets, I hypothesize that depression might be in essence similar to cancer, where the end phenotype is classifiable as a disorder (although this is not sufficiently quantitative, and controversial even in the literature) but the contributors are so different that "averaging and seeing a signal" is not feasible. Another hypothesis is that the treatments that offer remission in depressive episodes do so by alleviating some other critical factor in the brain, and not interact with the underlying cause. This opinion stems from the fact that a lot of antidepressants act on different pathways, but can produce the same remission. It is important to note that the pharmacologic/psychiatric industry puts the threshold at 50% reduction in symptoms for antidepressants, which personally seems there is an underlying "as best as we can do with with our understanding of symptoms and the disease" statement under that low threshold.

  • 3.4 Conclusions

In conclusion, the network analysis supported the authors experiments conducted on resistant and remittant phenotypes of neural cells derived from patient IPSCs. The enriched gene set network for SSRI-resistant (NR) had gene sets that correlated to higher synaptic response, especially connected to a theme network with calcium-related synaptic activity. This aligns well with the phenotype that was seen in the experiments conducted by the authors. Additionally, most of the well-known and common antidepressants (TCAs, SSRIs, MAOIs, Benzodiazepines, and Augmentors (atypical antidepressants)) had a lot of common and distinct pathway interactions with various gene sets in the all three of the networks (HvR, HvNR, NRvR).