Assignment 2 : Differential Gene expression and Preliminary ORA - bcb420-2023/Metyu_Melkonyan GitHub Wiki

Part 1: Differential Gene Expression Analysis

Introduction

By cleaning and validating our dataset accuracy, I am ready to go further analyze and normalize my dataset of Transcriptional control of subtype switching ensures adaptation and growth of pancreatic cancer published on May 16, 2019. The dataset contains a lot of genes invovled in pancreas cancer (1) The replicated and null values were elominated from the dataset. This resultedin a clear and concise analysis in the further parts. By using different analyziz methods I was able to understand the importance of data cleaning and choosing the right data for this assignment. The HUGO symbol validation allowed to strengthen and provide high quality data in terms of reproducibility of further analysis such as normalization and divergence of gene expression checking.

The next couple of analysis starting from part 1 and part 2 will be composed of series of analysis, visualizations to see how variable the genes associate with the pancreas cancer are expressed. The expression level of the geneset varies depending on the how well the analysis is done and how well the data is preserved.

Objectives

  • Assessing to what extent the EMT genes are differentially expressed within the dataset

  • Visualize the treatment type clustering among EV (empty vector) vs GLI2

  • Applying correction to calculate the expression fold difference between the control and the GLI2 vector containing cells

  • Justifying SPP1 differential expression compared to other genes in the dataset

  • GLI2: It is a transcription factor that is closely associated with basal sub-type switching in Pancreatic ductal adenocarcinoma (PDAC) of pancreas cells

  • EV: Empty vector containing basal cells


Figure 1. The EdgeR package Heatmap: The heatmap allowed for better assessment of gnee expression of inddividual treatment and clustered genesets

Duration

Time estimated: ~5 hours Time took: ~8 hours Date started: 2023-03-10 ; Completed: 2023-03-12

Conclusion

  • The differential analysis of gene expression was conducted with the model created
  • P-value distribution allowed to strengthen the main conclusion of the paper
  • Both upregulated and downregulated genes were found along with strengthening the hypothesis of the main research RNA-seq analysis

image

Figure 2.The MA plot as required: The model consists of one class. Unfortunately, the MA plot is not as much in detail as it should be. However given our gene SPN1 is in the specified red zone p-value portion !

Part 2: Preliminary ORA

Objectives

  • Creating a model to analyze normalization values from the data
  • Assess fold expression difference among the normalized gene expression values
  • New GSE131222 data retrieval from Gene Expression Omnibus (GEO)
  • Observe the downregulation and upregulation of the GLI2 vector
  • Analyze both upregulated and downregulated results

newplot

Figure 3. The g:Profiler thresholded analysis: The upregualted and downregualted genes are above the threshold. The analysis is composed both downregulated and the upregualted genesets.

The hard coded g:Profile analysis

# GLI2_emt gene analysis and non-emt gene analysis
# Upregulated as a table. The GO:BP,REAC and WP sources were used.
Upregulated_genes <- read.table(file=file.path(getwd(),"data" ,"GLI2_emt_Genes.txt"),header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
Upregulated_genes <- Upregulated_genes[4:nrow(Upregulated_genes),]
Upregulated_Gprofiler_validated <- gprofiler2::gost(query = Upregulated_genes, 
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  sources = c("GO:BP", "REAC", "WP"),
                                  correction_method = "fdr",
                                  ordered_quer = FALSE,
                                  )
# Similarly downregualted as a table,correction method FDR is used to find the strignency and validate the data accuracy. 
# Exact database source was used as in the upregulated
Downregulated_genes <- read.table(file=file.path(getwd(),"data" ,"GLI2_non_emtgenes.txt"),header = TRUE, stringsAsFactors = FALSE, check.names = FALSE)
Downregulated_genes <- Downregulated_genes[13:nrow(Downregulated_genes),]
Downregulated_Gprofiler_validated <- gprofiler2::gost(query = Downregulated_genes,
                                  organism = "hsapiens", 
                                  exclude_iea = TRUE,
                                  correction_method = "fdr",
                                  ordered_quer = FALSE,
                                  source = c("GO:BP", "REAC", "WP"))

Duration

Time estimated: ~4 hours Time took: ~8 hours Date started: 2023-03-12 Completed: Around the same time as the assignment's due date.

Conclusion

  • The fold change after Dox treatment is always almost higher than the empty vector (EV)
  • Upregulated genes passing the p-value of 0.0005 is 2434. The stringent p-value allowed a better assess the author's conclusion of EMT genes
  • The pathway analysis results allowed me to verify the main hypothesis of the research paper
  • The Oncogenes (EMT) were detected
  • The upregualted genesa re found to be around 1000 genes whereas the downregulated genes are found to be around 700

Interpretation

The non-EMT (Epithelial-to-Mesenchymal Transition) genes and EMT genes were detected in the analysis, and further comparison with non-thresholded analysis results will be conducted. P-value assessments were performed, resulting in stringent values that were used for analysis to ensure reliability. The findings support the authors' hypothesis, as most of the upregulated gene pathways are associated with apoptosis and cancerous gene activity, while the downregulated gene pathways are related to neutral and harmless cellular activity. This validates the authors' conclusion and strengthens the main research hypothesis based on the pathway analysis results. The results of the analysis, which showed significant differential gene expression between the treatment groups, align with the role of GLI2, a transcription factor associated with basal subtype switching in pancreatic ductal adenocarcinoma cells, as discussed in the introduction section. Additionally, the analysis of fold expression difference among the normalized gene expression values and the comparison of upregulated and downregulated genes provide further support for the main conclusions of the research paper.

The duration and completion dates of the analysis are also provided to give context to the timeline of the work. Overall, the interpretation section reflects the successful completion of the objectives outlined in the assignment and supports the conclusions discussed in the original paper. Additionally, the analysis provides insights into the potential implications of the findings in relation to EMT genes and their role in cancer-related cellular processes. Further analysis with non-thresholded results may provide additional insights, and the results of this analysis contribute to the understanding of gene expression changes and the potential significance of GLI2 in the context of pancreatic ductal adenocarcinoma. This interpretation provides a comprehensive overview of the findings and their implications, demonstrating a thorough understanding of the research conducted and its alignment with the original paper's conclusions.

Notes

I have had technical difficulties. I have reported these to my TA. The technical difficulty was due to my low RAM usage as well as my program has crashed a couple of times. I suspect the g:Profile packgae has high RAM usage. Please check insight to see the realt problema nd the slution respective to each problems I encountered !

References

  1. Adams, Christina R, Htet Htwe Htwe, Timothy Marsh, Aprilgate L Wang, Megan L Montoya, Lakshmipriya Subbaraj, Aaron D Tward, Nabeel Bardeesy, and Rushika M Perera. 2019. “Transcriptional Control of Subtype Switching Ensures Adaptation and Growth of Pancreatic Cancer.” Elife 8: e45313.
  2. Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.
  3. Davis, S., and P. Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (GEO) and BioConductor.” Bioinformatics 14: 1846–47.
  4. Durinck, S., P. Spellman, E. Birney, and W. Huber. 2009. “Mapping Identifiers for the Integration of Genomic Datasets with the r/Bioconductor Package biomaRt.” Nature Protocols 4: 1184–91.
  5. Gu, Zuguang. 2022. “Complex Heatmap Visualization.” iMeta 1 (3): e43.
  6. Gu, Zuguang, Lei Gu, Roland Eils, Matthias Schlesner, and Benedikt Brors. 2014. “Circlize Implements and Enhances Circular Visualization in r.” Bioinformatics 30 (19): 2811–12.
  7. Magliano, Marina Pasca di, Shigeki Sekine, Alexandre Ermilov, Jenny Ferris, Andrzej A Dlugosz, and Matthias Hebrok. 2006. “Hedgehog/Ras Interactions Regulate Early Stages of Pancreatic Cancer.” Genes & Development 20 (22): 3161–73.
  8. R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.r-project.org/.
  9. Ritchie, Matthew E, Belinda Phipson, DI Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47.
  10. Robinson, M. D., McCarthy DJ, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26: 139–40.