Assignment 2 - bcb420-2025/Keren_Zhang GitHub Wiki
- Data Import
- Load Normalized Data
- Differential Gene Expression
- Thresholded over-representation analysis
- Interpretation
Date: March 3, 2025 - March 10, 2025
Estimated Time: 25 hour
Time Taken: 30 hours
Before the import begins, these are the R packages that I used for this assesment:
Libraries
library(GEOquery)
library(knitr)
library(ggplot2)
library(edgeR)
library(biomaRt)
library(dplyr)
library(tidyr)
library(limma)
library(circlize)
library(gprofiler2)
library(ComplexHeatmap)
Then, I used the getGEO
function from the GEOquery
package to fetch the dataset associated with the provided GEO Series ID. The function is set with GSEMatrix=FALSE to retrieve the full GEO accession object, not just the expression data matrix.
data_set_geoid <- "GSE276387"
# Obtain Data from GEO
gse <- getGEO(data_set_geoid ,GSEMatrix=FALSE)
#Sumamry of Dataset
gse@header$summary
Platforms Information:
- Determining the number of platforms associated with the dataset using
GPLList(gse)
and prints this information. - Retrieves the IDs of the first and second platforms and prints their titles, last update dates, and the organisms they are based on.
normalized_luad_data <- read.csv("normalized_luad_log2.csv")
data_set_geoid <- "GSE276387"
The normalized dataset includes 5 Patient Derived Organoid (PDO) samples, 5 Metastasis Derived Organoid (MDO) samples, and 4 Human Tumor samples. These sample types will serve as the primary factors in the DGE analysis due to their direct relevance and distinct biological origins.
As Limma
has been updated to include improved tools for RNASeq data, it will be used to generate the Multi-Dimensional Scaling (MDS) plot for the normalized data.
luad_matrix <- normalized_luad_data[, -1]
rownames(luad_matrix) <- normalized_luad_data$X # Since 'X' contains gene names or IDs
colnames(luad_matrix) <- colnames(normalized_luad_data[, 2:ncol(normalized_luad_data)])
limma::plotMDS(luad_matrix, labels=NULL, pch=1,
col = c("blue",
"purple", "green")[factor(sample_type_df$`Cell Type`)])
title("Figure 1: MDS plot of different sample types of lung cancer")
legend("topleft",
legend=levels(factor(sample_type_df$`Cell Type`)),
pch=c(1), col=c("blue", "purple", "green"),
title = "Sample Types", bty='n', cex=0.75)
# Ensure that the factor is correctly ordered if necessary
sample_type_df$`Cell Type` <- factor(sample_type_df$`Cell Type`, levels = c("Tumor", "MDO", "PDO"))
# Create the model design matrix
model_design <- model.matrix(~ sample_type_df$`Cell Type`)
# Display the first few rows of the design matrix
model_design[1:5,]
To evaluate the differential expression across different cell types in the lung adenocarcinoma dataset, I performed the following steps using the edgeR
package in R:
-
Data Preparation:
- Loaded count data into a
DGEList
object, specifying groups according to cell type (Tumor
,MDO
,PDO
).
d = DGEList(counts=luad_matrix, group=sample_type_df$`Cell Type`)
- Loaded count data into a
-
Estimation of Dispersion:
- Dispersion parameters were estimated to account for variability across samples not explained by the experimental design alone.
d <- estimateDisp(d, model_design)
-
Model Fitting:
- Fitted a generalized linear model to the data to assess differences in gene expression.
fit <- glmQLFit(d, model_design)
-
Statistical Tests:
- Conducted likelihood ratio tests comparing MDO and PDO groups to the Tumor group.
qlf.Tumor_vs_MDO <- glmQLFTest(fit, coef='sample_type_df$`Cell Type`MDO') qlf.Tumor_vs_PDO <- glmQLFTest(fit, coef='sample_type_df$`Cell Type`PDO')
# For MDO vs. Tumor Comparison
qlf_MDO_hits <- topTags(qlf.Tumor_vs_MDO, sort.by = "PValue",
n = nrow(luad_matrix))
DGE_MDO <- length(which(qlf_MDO_hits$table$PValue < 0.05))
print(paste("# of significantly differentially expressed genes MDO vs. Tumor:", DGE_MDO))
# For PDO vs. Tumor Comparison
qlf_PDO_hits <- topTags(qlf.Tumor_vs_PDO, sort.by = "PValue",
n = nrow(luad_matrix))
DGE_PDO <- length(which(qlf_PDO_hits$table$PValue < 0.05))
print(paste("# of significantly differentially expressed genes PDO vs. Tumor:", DGE_PDO))
In the differential expression analysis performed using edgeR
, a quasi-likelihood model was employed to identify genes that are differentially expressed between different sample types. This type of model is particularly advantageous in RNA-seq data analysis due to its strength in handling overdispersed count data, which is common in sequencing experiments.
A p-value threshold of 0.05 was applied. A p-value of 0.05 is widely recognized as a standard benchmark for identifying statistically significant results. It implies that there is only a 5% chance that the observed differences (or more extreme) would occur if there were no true differences (null hypothesis is true).
2.1 Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why?
The False Discovery Rate (FDR) built into edgeR
is employed as the method for correcting p-values for multiple hypothesis testing. FDR is designed to control the expected proportion of incorrectly rejected null hypotheses (false discoveries). This is especially important in high-throughput data like RNA-seq, where conducting thousands of tests can lead to many false leads if only raw p-values are considered.
EdgeR
calculates the FDR using the Benjamini-Hochberg
procedure by default when performing tests like glmQLFTest
. This method is well-integrated into the output from edgeR
, where the FDR values are directly provided alongside the p-values for each gene tested.
length(which(qlf_MDO_hits$table$FDR < 0.05))
length(which(qlf_PDO_hits$table$FDR < 0.05))
It appears that no genes passed the correction for false discovery rate (FDR) in both comparisons—Tumor vs. MDO (Metastasis-Derived Organoids) and Tumor vs. PDO (Patient-Derived Organoids).
Since no genes has passed thre correction, the MA Plot is created using the pre-correction analysis.
# For MDO vs. Tumor Comparison
edgeR::maPlot(logAbundance=qlf_MDO_hits$table$logCPM,
logFC=qlf_MDO_hits$table$logFC,
de.tags=which(qlf_MDO_hits$table$PValue < 0.05),
lowess=TRUE,
xlab="log(counts per million)",
ylab="log(fold-change)",
main="Figure 2: MA Plot of significantly differentially
expressed genes (PValue < 0.05) for MDO vs Tumor")
legend("topright", title="Significance",
legend=c("not significant",
"significant (PValue < 0.05)"),
col=c("black", "red"), pch=1, cex=0.8)
I will not repeat the code here, but it is essentially the same thing for PDO:
Figure 2 visualizes the differential expression analysis results for genes comparing Metastasis-Derived Organoids (MDO) to Tumor samples. In this plot, the log fold change (logFC) is plotted against the log counts per million (logCPM), providing a clear visualization of how gene expression varies between these two cell types. The red points indicate genes that are significantly differentially expressed with a p-value less than 0.05, highlighting those genes that stand out based on their expression changes, despite none passing the FDR correction for multiple testing. The inclusion of a LOWESS line (locally weighted scatterplot smoothing) helps to identify overall trends in the data, emphasizing systematic shifts in gene expression or potential biases.
Similarly, the MA plot in Figure 3 for the comparison between Patient-Derived Organoids (PDO) and Tumor samples illustrates the log fold changes versus logCPM. Here again, genes with p-values less than 0.05 are marked in red, indicating significant differential expression before correction for multiple comparisons.
# Basic volcano plot setup
plot(qlf_MDO_hits$table$logFC, -log10(qlf_MDO_hits$table$PValue),
main="Figure 4: Volcano Plot of Significantly Differentially
Expressed Genes MDO vs Tumor",
xlab="Log(fold change)",
ylab="-log10(p-value)",
pch=20, col="grey") # Start with all points in grey
# Highlighting significantly differentially expressed genes by p-value
points(qlf_MDO_hits$table$logFC[which(qlf_MDO_hits$table$PValue < 0.05)],
-log10(qlf_MDO_hits$table$PValue)[which(qlf_MDO_hits$table$PValue < 0.05)],
col="blue", pch=20)
# Highlighting significantly differentially expressed genes by FDR
points(qlf_MDO_hits$table$logFC[which(qlf_MDO_hits$table$FDR < 0.05)],
-log10(qlf_MDO_hits$table$PValue)[which(qlf_MDO_hits$table$FDR < 0.05)],
col="red", pch=20)
# Adding a legend to the plot
legend("topright", title="Significance",
legend=c("Not significant",
"Significant (PValue < 0.05)",
"Significant (FDR < 0.05)"),
col=c("grey", "blue", "red"), pch=20, cex=0.8)
Likewise for PDO:
As it could be observed from the volcano plot, there are no red points, which parallels the fact that there are no genes that passed the correction.
tops <- rownames(qlf.Tumor_vs_MDO$table)[
qlf.Tumor_vs_MDO$table$PValue < 0.05]
top_genes_MDO_matrix = t(scale(
t(luad_matrix[which(rownames(luad_matrix) %in% tops),]) ))
if(min(top_genes_MDO_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(top_genes_MDO_matrix)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(top_genes_MDO_matrix), 0,
max(top_genes_MDO_matrix)),
c("blue", "white", "red"))
}
# annotations
unique_celltype <- unique(sample_type_df$`Cell Type`)
unique_celltypecolors <- rainbow(n = length(unique_celltype))
names(unique_celltypecolors) <- unique_celltype
unique_tissuetype <- unique(sample_type_df$`Tissue Type`)
unique_tissuecolors <- rainbow(n = length(unique_tissuetype))
names(unique_tissuecolors) <- unique_tissuetype
ha_pat <- HeatmapAnnotation(df = data.frame(cell_type = sample_type_df$`Cell Type`,
tissue_type = sample_type_df$`Tissue Type`),
col = list(cell_type = unique_celltypecolors,
tissue_type = unique_tissuecolors),
show_legend = TRUE)
current_heatmap <- Heatmap(as.matrix(top_genes_MDO_matrix),
top_annotation = ha_pat,
cluster_rows = TRUE,
cluster_columns = TRUE,
show_row_dend = TRUE,
show_column_dend = TRUE,
col=heatmap_col,
show_column_names = FALSE,
show_row_names = FALSE,
show_heatmap_legend = TRUE,
name="colour scale",
column_title = ("Figure 6: Heatmap of top hits across different cell types"))
current_heatmap
The heatmap was colored using a gradient that transitions from blue (representing lower expression levels) to white (median expression levels), and then to red (higher expression levels). This color scheme was chosen to intuitively display the range of expression levels, where drastic color changes correspond to significant differences in gene expression.
Additionally, I annotated the heatmap with metadata on cell type and tissue type, providing further context to the expression data. Each cell type and tissue type was assigned a unique color, and these annotations were aligned with the corresponding columns in the heatmap. This allows for easy observation of patterns or clusters in gene expression relative to specific cell or tissue types, which could indicate shared biological functions or conditions influencing gene expression.
The resulting heatmap, titled 'Figure 6: Heatmap of top hits across different cell types,' provides a comprehensive visual summary of the data, highlighting the gene expression variances across cell types and potentially pointing to biological mechanisms involved in tumor progression and metastasis. By clustering both the rows (genes) and columns (samples), the heatmap not only displays the gene expression but also suggests groups of genes or samples that behave similarly, which could be critical for identifying new targets for therapeutic intervention or for better understanding the molecular underpinnings of cancer metastasis."
Based on the heatmap provided in Figure 6, which displays gene expression patterns across different cell and tissue types, it is observable that conditions based on cell types (Tumor, MDO, and PDO) show specific clustering patterns, particularly with MDO and PDO often clustering together. This observation is further enhanced by including tissue type annotations, which offer deeper insights into the distribution and relationships of gene expression variations.
For Cell Type Clustering, Metastasis-Derived Organoids (MDO) and Patient-Derived Organoids (PDO) often appear to cluster together. This might suggest that the gene expression profiles in organoids, regardless of their derivation from primary or metastatic sites, share similarities likely due to common in vitro culture conditions or inherent properties of organoid biology. Both MDOs and PDOs are grown under similar laboratory conditions, which may standardize certain aspects of their gene expression, aligning them more closely with each other than with their original tumor counterparts. Tumor samples sometimes form distinct clusters apart from MDOs and PDOs, indicating unique expression profiles that might be reflective of the more complex in vivo tumor microenvironment.
Task: With your significantly up-regulated and down-regulated set of genes run a thresholded gene set enrichment analysis
# create non-thresholded gene sets
MDO_nt_geneset <- qlf_MDO_hits$table
MDO_nt_geneset[,"rank"] <- -log10(MDO_nt_geneset$PValue) * sign(MDO_nt_geneset$logFC)
MDO_nt_geneset <- MDO_nt_geneset[order(MDO_nt_geneset$rank),]
PDO_nt_geneset <- qlf_PDO_hits$table
PDO_nt_geneset[,"rank"] <- -log10(PDO_nt_geneset$PValue) * sign(PDO_nt_geneset$logFC)
PDO_nt_geneset <- PDO_nt_geneset[order(PDO_nt_geneset$rank),]
g:Profiler will be used to run over-representation analysis as that is the method taught in class, and I had practiced this tool using instructions from the online tutorial.
For the over-representation analysis (ORA) of gene sets, the annotation data sources chosen were "GO:BP" (Gene Ontology: Biological Process), "KEGG" (Kyoto Encyclopedia of Genes and Genomes), "REAC" (Reactome Pathways), and "WP" (WikiPathways).
WP, GO:BP, and REAC were suggested in the tutorial. Using GO:BP helps identify specific biological processes that are enriched in the gene set, providing insights into the biological mechanisms potentially altered in the study's conditions. Reactome offers an in-depth look at various biological pathways, including more comprehensive coverage of pathway reactions and interactions than some other resources. WikiPathways allows for the integration of the most current and frequently updated pathway information, contributed by a global community of researchers.
KEGG was added due to its excellence for understanding the complex interactions of pathways and how various genes contribute to specific metabolic and signaling pathways.
version_info <- gprofiler2::get_version_info(organism = "hsapiens")
gobp <- version_info$sources$`GO:BP`
# KEGG Pathways
kegg <- version_info$sources$KEGG
# Reactome Pathways
reactome <- version_info$sources$REAC
# WikiPathways
wikipathways <- version_info$sources$WP
The g:Profiler
version used:
-
g:Profiler
version:r version_info$gprofiler_version
-
biomaRt
version:r version_info$biomart_version
The annotation sources used in query:
-
GO:BP
versionr gobp$version
-
REAC
versionr version_info$sources$REAC$version
-
WP
versionr version_info$sources$WP$version
-
KEGG
versionr version_info$sources$KEGG$version
4.1 Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
Question 3 and 4 will be performed together for better presentation and easier interpretation.
For the MDO group:
up_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05
& MDO_nt_geneset$logFC > 0)]
down_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05
& MDO_nt_geneset$logFC < 0)]
all_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05)]
# Run ORA for up-regulated genes
gost_up <- gprofiler2::gost(
query = up_genes_MDO,
organism = "hsapiens",
correction_method = "fdr",
exclude_iea = TRUE,
ordered_query = FALSE,
significant = FALSE,
sources = c("GO:BP", "KEGG", "REAC", "WP")
)
# Run ORA for down-regulated genes
gost_down <- gprofiler2::gost(
query = down_genes_MDO,
organism = "hsapiens",
correction_method = "fdr",
exclude_iea = TRUE,
ordered_query = FALSE,
significant = FALSE,
sources = c("GO:BP", "KEGG", "REAC", "WP")
)
# Run ORA for all genes
gost_all <- gprofiler2::gost(
query = all_genes_MDO,
organism = "hsapiens",
correction_method = "fdr",
exclude_iea = TRUE,
ordered_query = FALSE,
significant = FALSE,
sources = c("GO:BP", "KEGG", "REAC", "WP")
)
# Create a data frame summarizing the ORA results
gost_summary <- data.frame(
Category = c("Upregulated Genes", "Downregulated Genes", "All Genes"),
Gene_Sets = c(nrow(gost_up$result), nrow(gost_down$result), nrow(gost_all$result))
)
knitr::kable(gost_summary,
caption = "Table 1: Summary of Gene Set Enrichment Analysis Results for MDO",
align = 'c',
col.names = c("Category", "Number of Gene Sets"))
# Filter gene sets with a minimum of 5 genes and a maximum of 500 genes
gost_up_filtered <- gost_up$result[gost_up$result$term_size >= 5 & gost_up$result$term_size <= 500, ]
gost_down_filtered<- gost_down$result[gost_down$result$term_size >= 5 & gost_down$result$term_size <= 500, ]
gost_all_filtered <- gost_all$result[gost_all$result$term_size >= 5 & gost_all$result$term_size <= 500, ]
# Create a data frame summarizing the ORA results
gost_summary <- data.frame(
Category = c("Upregulated Genes", "Downregulated Genes", "All Genes"),
Gene_Sets = c(nrow(gost_up_filtered), nrow(gost_down_filtered), nrow(gost_all_filtered))
)
knitr::kable(gost_summary,
caption = "Table 2: Summary of Gene Set Enrichment Analysis Results for MDO After Filtering",
align = 'c',
col.names = c("Category", "Number of Gene Sets"))
To Visualize Top Regulated Genes:
knitr::kable(gost_up_filtered[1:10, c("term_size", "term_name")],
row.names = FALSE,
caption = "**Table 3: Top significant gene sets for upregulated genes for the MDO Group**")
knitr::kable(gost_all_filtered[1:10, c("term_size", "term_name")],
row.names = FALSE,
caption = "**Table 5: Top significant gene sets for all genes for the MDO Group**")
Likewise for the PDO so it will not be repeated here.
1. Do the over-representation results support conclusions or mechanism discussed in the original paper?
The Original Paper mentioned that they founds overexpressed genes (TFPI2, IL1B, SAA1, and SAA2) and underexpressed genes (BTG2 and FGFBP1) in metastases. Thus, the ORA results do provide support to some extent:
Upregulated Genes: The significant enrichment in terms related to "CD8-positive, alpha-beta T cell activation" and various forms of T cell proliferation (e.g., "alpha-beta T cell proliferation") suggests an active immune response or an immune-related mechanism in the MDO context. The mention of genes like TFPI2 and IL1B, which are known for their roles in inflammation and immune responses, correlates well with these enriched immune-related pathways. TFPI2 is known for its anti-inflammatory properties, while IL1B is a critical cytokine in the immune response.
The presence of pathways such as "amyloid-beta clearance" might indicate additional cellular functions that are upregulated in metastases, possibly related to cellular stress responses or other neuroinflammatory processes.
Downregulated Genes: The downregulation of pathways like "regulation of DNA repair" and "regulation of MAP kinase activity" might reflect changes in cell cycle control and DNA damage response, supporting the downregulation of genes like BTG2 which is known for its role in cell cycle regulation and tumor suppression. The "regulation of cellular response to stress" and "regulation of cell cycle G1/S phase transition" being affected aligns with the downregulation of FGFBP1, a gene implicated in cell proliferation and extracellular matrix interaction during cancer progression.
2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.
Studies have shown that CD8+ T cells play a crucial role in the immune surveillance of tumors and can be activated in response to tumor antigens[@GattiMays2021real]. The enrichment of T cell activation pathways suggests a possible immune evasion mechanism or a heightened immune response in the metastatic niche that could be related to the overexpression of inflammatory genes like IL1B.
Research indicates that disruptions in DNA repair mechanisms are a hallmark of many cancers, leading to genomic instability and tumor progression [@hamidi2018every]. The downregulation of these pathways in MDOs might reflect a compromised DNA repair capacity typical of more aggressive or advanced-stage cancers, supporting the observed downregulation of BTG2.
The MAP kinase pathway is integral to cell growth and differentiation signals. Its dysregulation is linked to various cancers (Source: Cancer Discovery, 2015, 5(7), 730–743. DOI: 10.1158/2159-8290.CD-14-0736). The downregulation of this pathway could indicate a disruption in these signaling processes, consistent with the cancer progression and metastatic behavior noted in the study.