Assignment 3 - bcb420-2025/Keren_Zhang GitHub Wiki
Table of Contents
Time
Date: March 28 - April 1st, 2025
Estimated Time: 20 hour
Time Taken: 35 hours
Introduction
In their 2024 study, Liu and colleagues focused on a type of lung cancer called lung adenocarcinoma (LUAD), which is one of the most common and deadliest forms of cancer globally. Even after doctors surgically remove the tumor, this cancer often spreads to other parts of the body in about half of the patients. To better understand and find ways to treat this spread, called metastasis, they developed a new experimental model using patient-derived organoids (PDOs). PDOs are three-dimensional cell cultures taken from patients' tumors that act like the actual biological environment of lung cancer. The study also used metastasis-derived organoids (MDOs), taken from different places where the cancer spread, like the brain, diaphragm, liver, and gallbladder, to study and find out more about what causes LUAD to spread and possible ways to treat it.
Previous Work
For this project, the RNA sequencing data used came from three different groups: patient-derived organoids (PDO), metastasis-derived organoids (MDO), and control group human tumor samples. This data was obtained from the Lung Adenocarcinoma (LUAD) study, which is documented and accessible under the GEO accession number GSE276387.
A1
In A1, the orginal 57445
data set was pre-processed and the genes were mapped
to HUGO Gene Nomenclature Committee (HGNC) symbols. Missing values and
duplicated genes were removed to ensure data integrity. Outliers were identified
and removed to enhance the robustness of the analysis. Finally the Trimmed Mean
of M-values (TMM) method was employed for normalization to minimize technical
variance.
The final resulting data set has 17195
genes over 14
samples (5 Patient
Derived Organoid samples, 5 Metastasis Derived Organoid samples, and 4 Human
Tumor Samples).
A2
Differential Gene Expression
In A2, a quasi-likelihood model built into edgeR
which is advantageous in
RNA-seq data analysis due to its strength in handling overdispersed count data,
and a FDR-corrected p value of 0.05
were used. The data revealed 332
significantly differentially expressed genes between Tumor and MDO samples, and
631
between Tumor and PDO samples, indicating more pronounced differences in
gene expression profiles between the tumor and PDO samples than between the
tumor and MDO samples. This suggests that PDOs may retain more of the distinct
cellular characteristics of the original tumors compared to MDOs, possibly due
to differences in the primary and metastatic tumor environments. Despite
numerous significant differences, no genes passed the False Discovery Rate
correction using the Benjamini-Hochberg
method, highlighting the challenges of
managing multiple hypothesis testing in high-throughput data analysis.
Thresholded over-representation analysis
Then g:Profiler
to analyze gene sets from differentially expressed genes in
MDO and PDO. For this analysis, gene sets were categorized into up-regulated,
down-regulated, and all genes to capture the complete spectrum of gene
expression changes.The annotation sources utilized included GO:BP
for
biological processes, KEGG
for metabolic and signaling pathways, REAC
for
comprehensive pathway interactions, and WP
for the latest pathway information.
For MDO, 2855
upregulated and 3058
downregulated gene sets were
identified, with a total of 4507
gene sets across all genes.
For PDO, there were
4533 upregulated,
3966downregulated genes, and a total of
6266` gene sets
across all genes.
Upon filtering for gene sets containing between 5
to 500
genes, the enriched
pathways provided insights into the cellular functions most impacted by gene
expression changes:
In MDO, the top upregulated pathways included T-cell activation and proliferation, indicating an immune response component, whereas downregulated pathways highlighted deficiencies in DNA repair and cell cycle regulation.
In PDO, top pathways for upregulated genes focused on epidermis development and differentiation, which might reflect the tissue-specific origins and adaptations of these organoids. The top pathways for downregulated genes included smooth muscle cell differentiation and various signaling pathways like BMP and MAP kinase.
Load Libraries
if (! requireNamespace("kableExtra", quietly = TRUE)) {
install.packages("kableExtra")
}
if (! requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BiocManager::install("ComplexHeatmap")
}
if (! requireNamespace("circlize", quietly = TRUE)) {
BiocManager::install("circlize")
}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
#install required R and bioconductor packages for GSEA
tryCatch(expr = { library("RCurl")},
error = function(e) {
install.packages("RCurl")},
finally = library("RCurl"))
library(ggplot2)
library(knitr)
library(kableExtra)
library(dplyr)
library(GEOquery)
# wraps lines that are too long
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=80), tidy=TRUE)
# set default behaviors for all chunks
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
Load data
In this assignment, we’ll start with the result from the comparison between
Tumor and PDO (Patient-Derived Organoids),which had 631 genes that are
significantly differentially expressed. It will be loaded it as an .rds
file.
The qlf_PDO_hits.rds
file can be generated by running saveRDS(qlf_PDO_hits,
'qlf_PDO_hits.rds') after running all code chunks in A2
# Render notebook 2
# rmarkdown::render("../A2/A2_KZ.Rmd")
#
# saveRDS(qlf_PDO_hits, 'qlf_PDO_hits.rds')
# load the output of the exact test model from Assignment 2
PDO_hits <- readRDS("qlf_PDO_hits.rds")
PDO_output <- PDO_hits$table
# Calculate the 'rank' for each entry in the PDO_output data frame
# 'rank' is computed as the negative log base 10 of the PValue multiplied by the sign of the logFC (log Fold Change)
# Transforms the P-value into a more interpretable scale and combines it with the direction of change (up or down)
PDO_output[, "rank"] <- -log(PDO_output$PValue, base = 10) * sign(PDO_output$logFC)
# Sort the PDO_output data frame by the 'rank' column in ascending order
# Place the most negatively significant results (significant downregulation) at the top
# and the most positively significant results (significant upregulation) at the bottom
PDO_output <- PDO_output[order(PDO_output$rank), ]
Non-thresholded Gene set Enrichment Analysis
In Assignment 2, we utilized thresholded over-representation analysis (ORA) to explore the LUAD dataset. However, ORA has notable limitations. Firstly, the choice of threshold is subjective, leading to variability in results depending on the threshold set. Different thresholds can significantly alter the analysis outcome, making the selection process somewhat arbitrary. Secondly, ORA has a tendency to overlook subtler signals within the data. This occurs because the method excludes data points that do not meet the predefined threshold, potentially missing important but less pronounced trends or patterns.
To address these drawbacks, for this assignment, the Gene Set Enrichment Analysis (GSEA) will be performed. GSEA offers a more nuanced approach by analyzing data without the need for arbitrary threshold settings. This method allows for the inclusion of all gene sets in the analysis, enhancing our ability to detect both strong and weak signals across the dataset. Through GSEA, we aim to gain a comprehensive understanding of the enrichment patterns within our data, ensuring that no significant patterns are overlooked due to thresholding constraints.
GSEA
There are two important steps for the GSEA analysis. Firstly, a rank file should be created, which serves as the foundation for the analysis:
# Create a rank file (.rnk) for GSEA
rnk <- data.frame(rownames(PDO_output), PDO_output$rank)
colnames(rnk) <- c("GeneName", "rank")
write.table(rnk, file = "./Tumour_vs_PDO.rnk", sep = "\t", col.name = TRUE, row.names = FALSE,
quote = FALSE)
Secondly, a pathway definition file, commonly in the GMT (Gene Matrix Transposed) format, is required. This file compiles curated gene sets representing various biological pathways, processes, or other functional groupings:
# Load GSEA related variables
gsea_jar = "~/projects/A3/GSEA_4.3.3/gsea-cli.sh"
working_dir = "~/projects/A3"
output_dir = "~/projects/A3"
analysis_name = "Tumour_vs_PDO"
rnk_file = "Tumour_vs_PDO.rnk"
dest_gmt_file = ""
# Download the latest pathway definition file from BaderLab released on March 1st, 2025. (As of April 2nd, 2025)
if(dest_gmt_file == ""){
gmt_url = "http://download.baderlab.org/EM_Genesets/current_release/Human/symbol/"
#list all the files on the server
filenames = getURL(gmt_url)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
#get the gmt that has all the pathways and does not include terms
# inferred from electronic annotations(IEA)
#start with gmt file that has pathways only and GO Biological Process only.
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_noPFOCR_no_GO_iea.*.)(.gmt)(?=\">)",
contents, perl = TRUE)
gmt_file = unlist(regmatches(contents, rx))
dest_gmt_file <- file.path(output_dir,gmt_file )
#check if this gmt file already exists
if(!file.exists(dest_gmt_file)){
download.file(
paste(gmt_url,gmt_file,sep=""),
destfile=dest_gmt_file
)
}
}
# Construct the command to run GSEA PreRanked analysis with specified parameters
# cmd <- paste("", gsea_jar,
# "GSEAPreRanked -gmx", dest_gmt_file, # Specify the gene sets file (GMT)
# "-rnk", file.path(working_dir, rnk_file), # Define the ranking file path
# # Analysis settings: no collapsing, 1000 permutations, weighted scoring
# "-collapse false -nperm 1000 -scoring_scheme weighted",
# "-rpt_label", analysis_name, # Label for the report generated by GSEA
# # Plot top 20 sets, seed for reproducibility, max gene set size
# " -plot_top_x 20 -rnd_seed 12345 -set_max 200",
# " -set_min 15 -zip_report false", # Minimum gene set size, no zip compression for the report
# " -out", output_dir, # Output directory for GSEA results
# " > gsea_output.txt", sep=" ") # Redirect output to a text file
#
# # Execute the command using system call
# system(cmd)
Cytoscape
Installation
As an R version of the software is not available currently, Cytoscape Version 3.10.3
for windows is used for this assignment, and enrichment maps is outputted manually.
Inside the software, EnrichmentMap has to be installed form the Cytoscape Appstore.
The following parameters are used for the analysis:
For the Enrichment Map analysis, a total of 321 edges and 1170 nodes are displayed:
AutoAnnotate:
Create Enrichment Map
The enrichment map was created using the following thresholds:
(note that the enrichment, annotation, and rank files are automatically captured
from the GSEA results folder)
-
Enrichment Pos: 'gsea_report_for_na_pos_1743500855381.tsv'
-
Enrichment Neg: 'gsea_report_for_na_neg_1743500855381.tsv'
-
GMT: 'Human_GOBP_AllPathways_noPFOCR_no_GO_iea_March_01_2025_symbol.gmt'
-
Expressions: 'Tumour_vs_PDO.rnk'
-
FDR q-value cutoff 0.1
-
p-value cutoff 0.05
-
gene-set similarity filtering cutoff 0.375 (with default metric: Jaccard+Overlap combined)
The enrichment map shows 321
nodes and 1170
edges.
knitr::include_graphics('Tumour_Vs_PDO.png')
Figure 1. Enrichment map of the significant gene sets and pathways, prior to annotation and manual layout. Each node represent a pathway, with red representing up-regulation and blue representing down-regulation. Edges represent similarity between the pathways with thicker edge corresponding to higher similarity score.
Annotation Parameters
Using the AutoAnnotate App available in Cytoscape, the enrichment map is annotated with the following parameters:
- Max number of annotations: 10
- Cluster Options: Use clusterMaker2 App
- Cluster algorithm: MCL Cluster
- Edge weight column: similarity_coefficient
- Label Column: GS_DESCR
- Label Algorithm: WordCloud: Adjacent Words (default)
- Max words per label: 3
- Minimum word occurrence: 1
- Adjacent word bonus: 8
Publication Ready Figure
knitr::include_graphics("Annotated_EM_pub_ready.png")
Figure 2. Enrichment map of the significant gene sets and pathways, after annotation and layout adjustments. Each node represent a pathway, with red representing up-regulation and blue representing down-regulation. Edges represent similarity between the pathways with thicker edge corresponding to higher similarity score. The Publication-Ready box is checked and figure is manually adjusted to be more compact.
For better readability, here is an image of the Legend:
knitr::include_graphics("Legend.png")
Figure 3. Enlarged Legend for Figure 2.
Theme Network
The theme network is generated by clicking “Collapse All” to group gene sets with a single theme node using the AutoAnnotate function. I manually expanded the top three clusters for better look at the major themes.
knitr::include_graphics("theme.png")
Figure 4. Theme network of the significant gene sets and pathways after collapsing all clusters and expanding the top three clusters. Each node represent a pathway, with red representing up-regulation and blue representing down-regulation. Edges represent similarity between the pathways with thicker edge corresponding to higher similarity score. The Publication-Ready box is checked and figure is manually adjusted to be more compact.
Major Themes Present Looking at the above figure, the top 3 major themes included in the thematic network are:
-
Regulation of Cell Immunity: Immunity plays a crucial role in cancer progression and response to therapy. In lung cancer, immune cells can influence tumor growth and metastasis either by suppressing the tumor through immune surveillance mechanisms or, paradoxically, supporting tumor growth and evasion through various mechanisms. This cluster suggests that gene sets related to immune cell regulation are significantly active, which is highly relevant in the context of immunotherapy strategies for LUAD.
-
Regulation of Chemotaxis Migration: Chemotaxis, the movement of cells in response to chemical stimuli, is a key mechanism in cancer metastasis. This theme might include pathways that facilitate tumor cell migration or the recruitment of immune cells to the tumor microenvironment. Understanding these pathways can help in identifying potential targets for preventing metastasis or enhancing immune cell infiltration into tumors.
-
Anti-favouring Leishmania: This theme seems initially unrelated to lung cancer; however, it might highlight underlying immune responses or signaling pathways that are similarly exploited in different disease contexts, such as infectious diseases and cancer. The immune system's mechanisms to combat pathogens (like Leishmania) might share pathways with those engaged in battling cancer cells, potentially offering insights into nonspecific immune activation or suppression observed in lung cancer. This theme could represent a novel pathway or a niche area of study within the broader research context. It could introduce new perspectives or mechanisms previously unconsidered or underexplored in the model.
The identified themes of "Regulation of Cell Immunity" and "Regulation of Chemotaxis Migration" fit exceptionally well with the model of modeling lung adenocarcinoma metastases using patient-derived organoids. This model explores the tumor microenvironment and the interaction between cancer cells and the immune system, which are crucial for understanding how lung adenocarcinoma spreads and responds to treatments[@liu2024modeling]. The focus on cell immunity is particularly relevant, as it highlights the immune landscape's role in either promoting or inhibiting metastatic spread. Similarly, the theme of chemotaxis migration aligns with the study of how cancer cells migrate and invade new tissues—an essential aspect of metastasis. Both themes are integral to developing more effective therapeutic strategies in organoid models, which aim to replicate the complex biological interactions found in patients more accurately than traditional in vitro models.
Conclusion
The enrichment results from the analysis, which include upregulated themes related to skin development processes like "Keratinization," "Formation of the Cornified Envelope," and "Epidermis Development," as well as downregulated themes involving immune interactions and complement pathways, support several conclusions and mechanisms discussed in the original paper on modeling lung adenocarcinoma metastases using patient-derived organoids. The original paper discusses the role of the in vivo PDO metastasis model in elucidating mechanisms of resistance to targeted therapy and in identifying personalized immune-priming strategies. These topics correlate well with the downregulation of immune-related pathways such as "Immunoregulatory Interactions between a Lymphoid and a Non-Lymphoid Cell" and "Complement Cascade." Such downregulation could reflect the complex interplay between the tumor microenvironment and the immune system, which the paper addresses when discussing the use of PDOs cocultured with PBMCs to determine optimal immune-priming strategies. Moreover, the upregulated pathways related to skin development may not directly correspond to lung adenocarcinoma's typical pathways but could signify epithelial-related processes that are active in the organoid models. This might suggest a degree of organoid adaptation or specific signaling pathways being studied that are also active in skin-related tissues or processes of epithelial differentiation that are common in various carcinomas, including lung adenocarcinoma. Therefore, while the direct connection between skin development pathways and lung adenocarcinoma isn't immediately apparent, the mechanisms related to cellular differentiation and immune modulation highlighted in the enrichment results do align with the discussions in the original paper about organoid behavior, metastatic modeling, and the immune environment's role in cancer progression and response to therapy. This suggests that the organoids may be capturing critical aspects of tumor biology that are relevant both in the direct context of lung cancer and in broader cancer biology.
The comparison between results from Assignment #2's thresholded Over-Representation Analysis (ORA) and Assignment #3's Gene Set Enrichment Analysis (GSEA) highlights distinct thematic focuses due to their methodological differences. ORA results are concentrated primarily on specific cellular activities and signaling pathways such as smooth muscle differentiation and BMP signaling, reflecting the most significantly altered pathways. In contrast, GSEA reveals a broader scope of biological processes, particularly emphasizing immune system interactions and skin development processes. GSEA’s ability to assess enrichment across all ranked genes allows it to capture both dominant and subtler biological themes, thus providing a more comprehensive view of the dataset's underlying biological dynamics compared to the more narrowly focused ORA.
There is evidence from recent research that supports some of the results observed in your analysis, particularly concerning the role of keratinization in lung adenocarcinoma (LUAD). Studies have demonstrated that keratins, which are critical structural components of epithelial cells, play significant roles beyond mere structural functions. They are involved in regulating cell functions and can influence cancer progression by affecting pathways like epithelial-mesenchymal transition (EMT) and the TGF-β signaling pathway, both of which are crucial for cancer metastasis and progression. For instance, a study highlighted the prognostic significance of a keratin gene signature in LUAD, associating high expression with poorer survival outcomes. This signature was linked to enhanced activation of the TGF-β signaling pathway and was correlated with aggressive cancer traits, such as increased EMT and tumor stemness. These findings underscore the importance of keratinization processes in the pathology of LUAD and their potential impact on clinical outcomes, aligning well with the themes of keratinization and EMT processes identified in your GSEA results[@li2023Keratin].
Furthermore, another detailed study integrated transcriptomic, methylomic, and miRNA profiles to identify key regulators of keratinization in lung squamous cell carcinoma, another non-small cell lung cancer subtype closely related to LUAD. This research identified critical genes and pathways that could also be relevant in the context of LUAD due to similar underlying epithelial cell mechanisms[@heryanto2023Identify].
These studies provide a solid foundation of evidence that supports the biological relevance of the pathways identified in your enrichment analysis, reinforcing the potential for these pathways to serve as targets for diagnostics or therapeutic interventions in lung adenocarcinoma.