Assignment 2 - bcb420-2025/Keren

Data Import
Load Normalized Data
Differential Gene Expression
Thresholded over-representation analysis
Interpretation

Time

Date: March 3, 2025 - March 10, 2025

Estimated Time: 25 hour

Time Taken: 30 hours

Data Import

Before the import begins, these are the R packages that I used for this assesment:

Libraries

library(GEOquery)
library(knitr)
library(ggplot2)
library(edgeR)
library(biomaRt)
library(dplyr)
library(tidyr)
library(limma)
library(circlize)
library(gprofiler2)
library(ComplexHeatmap)

Download

Then, I used the getGEO function from the GEOquery package to fetch the dataset associated with the provided GEO Series ID. The function is set with GSEMatrix=FALSE to retrieve the full GEO accession object, not just the expression data matrix.

data_set_geoid <- "GSE276387" 

# Obtain Data from GEO
gse <- getGEO(data_set_geoid ,GSEMatrix=FALSE)

#Sumamry of Dataset
gse@header$summary

Platforms Information:

Determining the number of platforms associated with the dataset using GPLList(gse) and prints this information.
Retrieves the IDs of the first and second platforms and prints their titles, last update dates, and the organisms they are based on.

Load Normalized Data

normalized_luad_data <- read.csv("normalized_luad_log2.csv")

data_set_geoid <- "GSE276387"

Differential Gene Expression

The normalized dataset includes 5 Patient Derived Organoid (PDO) samples, 5 Metastasis Derived Organoid (MDO) samples, and 4 Human Tumor samples. These sample types will serve as the primary factors in the DGE analysis due to their direct relevance and distinct biological origins.

Limma MDS

As Limma has been updated to include improved tools for RNASeq data, it will be used to generate the Multi-Dimensional Scaling (MDS) plot for the normalized data.

luad_matrix <- normalized_luad_data[, -1]

rownames(luad_matrix) <- normalized_luad_data$X  # Since 'X' contains gene names or IDs
colnames(luad_matrix) <- colnames(normalized_luad_data[, 2:ncol(normalized_luad_data)])


limma::plotMDS(luad_matrix, labels=NULL, pch=1,
               col = c("blue", 
                       "purple", "green")[factor(sample_type_df$`Cell Type`)])
title("Figure 1: MDS plot of different sample types of lung cancer")
legend("topleft",
       legend=levels(factor(sample_type_df$`Cell Type`)),
       pch=c(1), col=c("blue", "purple", "green"),
       title = "Sample Types", bty='n', cex=0.75)

Model Setup

# Ensure that the factor is correctly ordered if necessary
sample_type_df$`Cell Type` <- factor(sample_type_df$`Cell Type`, levels = c("Tumor", "MDO", "PDO"))

# Create the model design matrix
model_design <- model.matrix(~ sample_type_df$`Cell Type`)

# Display the first few rows of the design matrix
model_design[1:5,]

DGE - EdgeR

1.1 Calculate p-values for each of the genes in your expression set.

To evaluate the differential expression across different cell types in the lung adenocarcinoma dataset, I performed the following steps using the edgeR package in R:

Data Preparation:
- Loaded count data into a DGEList object, specifying groups according to cell type (Tumor, MDO, PDO).
```
d = DGEList(counts=luad_matrix, group=sample_type_df$`Cell Type`)
```
Estimation of Dispersion:
- Dispersion parameters were estimated to account for variability across samples not explained by the experimental design alone.
```
d <- estimateDisp(d, model_design)
```
Model Fitting:
- Fitted a generalized linear model to the data to assess differences in gene expression.
```
fit <- glmQLFit(d, model_design)
```

Statistical Tests:

Conducted likelihood ratio tests comparing MDO and PDO groups to the Tumor group.

qlf.Tumor_vs_MDO <- glmQLFTest(fit, coef='sample_type_df$`Cell Type`MDO')
qlf.Tumor_vs_PDO <- glmQLFTest(fit, coef='sample_type_df$`Cell Type`PDO')

1.2 How many genes were significantly differentially expressed?

# For MDO vs. Tumor Comparison
qlf_MDO_hits <- topTags(qlf.Tumor_vs_MDO, sort.by = "PValue",
                  n = nrow(luad_matrix))

DGE_MDO <- length(which(qlf_MDO_hits$table$PValue < 0.05))

print(paste("# of significantly differentially expressed genes MDO vs. Tumor:", DGE_MDO))

# For PDO vs. Tumor Comparison
qlf_PDO_hits <- topTags(qlf.Tumor_vs_PDO, sort.by = "PValue",
                  n = nrow(luad_matrix))

DGE_PDO <- length(which(qlf_PDO_hits$table$PValue < 0.05))

print(paste("# of significantly differentially expressed genes PDO vs. Tumor:", DGE_PDO))

1.3 What thresholds did you use and why

In the differential expression analysis performed using edgeR, a quasi-likelihood model was employed to identify genes that are differentially expressed between different sample types. This type of model is particularly advantageous in RNA-seq data analysis due to its strength in handling overdispersed count data, which is common in sequencing experiments.

A p-value threshold of 0.05 was applied. A p-value of 0.05 is widely recognized as a standard benchmark for identifying statistically significant results. It implies that there is only a 5% chance that the observed differences (or more extreme) would occur if there were no true differences (null hypothesis is true).

2.1 Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why?

The False Discovery Rate (FDR) built into edgeR is employed as the method for correcting p-values for multiple hypothesis testing. FDR is designed to control the expected proportion of incorrectly rejected null hypotheses (false discoveries). This is especially important in high-throughput data like RNA-seq, where conducting thousands of tests can lead to many false leads if only raw p-values are considered.

EdgeR calculates the FDR using the Benjamini-Hochberg procedure by default when performing tests like glmQLFTest. This method is well-integrated into the output from edgeR, where the FDR values are directly provided alongside the p-values for each gene tested.

2.2 How many genes passed correction?

length(which(qlf_MDO_hits$table$FDR < 0.05))

length(which(qlf_PDO_hits$table$FDR < 0.05))

It appears that no genes passed the correction for false discovery rate (FDR) in both comparisons—Tumor vs. MDO (Metastasis-Derived Organoids) and Tumor vs. PDO (Patient-Derived Organoids).

3.1 Show the amount of differentially expressed genes using an MA Plot

Since no genes has passed thre correction, the MA Plot is created using the pre-correction analysis.

# For MDO vs. Tumor Comparison
edgeR::maPlot(logAbundance=qlf_MDO_hits$table$logCPM, 
              logFC=qlf_MDO_hits$table$logFC,
              de.tags=which(qlf_MDO_hits$table$PValue < 0.05),
              lowess=TRUE,
              xlab="log(counts per million)",
              ylab="log(fold-change)",
              main="Figure 2: MA Plot of significantly differentially 
              expressed genes (PValue < 0.05) for MDO vs Tumor")
legend("topright", title="Significance",
       legend=c("not significant",
                "significant (PValue < 0.05)"),
       col=c("black", "red"), pch=1, cex=0.8)

I will not repeat the code here, but it is essentially the same thing for PDO:

Figure 2 visualizes the differential expression analysis results for genes comparing Metastasis-Derived Organoids (MDO) to Tumor samples. In this plot, the log fold change (logFC) is plotted against the log counts per million (logCPM), providing a clear visualization of how gene expression varies between these two cell types. The red points indicate genes that are significantly differentially expressed with a p-value less than 0.05, highlighting those genes that stand out based on their expression changes, despite none passing the FDR correction for multiple testing. The inclusion of a LOWESS line (locally weighted scatterplot smoothing) helps to identify overall trends in the data, emphasizing systematic shifts in gene expression or potential biases.

Similarly, the MA plot in Figure 3 for the comparison between Patient-Derived Organoids (PDO) and Tumor samples illustrates the log fold changes versus logCPM. Here again, genes with p-values less than 0.05 are marked in red, indicating significant differential expression before correction for multiple comparisons.

3.2 Show the amount of differentially expressed genes using an Volcano Plot

# Basic volcano plot setup
plot(qlf_MDO_hits$table$logFC, -log10(qlf_MDO_hits$table$PValue),
     main="Figure 4: Volcano Plot of Significantly Differentially 
     Expressed Genes MDO vs Tumor",
     xlab="Log(fold change)",
     ylab="-log10(p-value)",
     pch=20, col="grey")  # Start with all points in grey

# Highlighting significantly differentially expressed genes by p-value
points(qlf_MDO_hits$table$logFC[which(qlf_MDO_hits$table$PValue < 0.05)],
       -log10(qlf_MDO_hits$table$PValue)[which(qlf_MDO_hits$table$PValue < 0.05)],
       col="blue", pch=20)

# Highlighting significantly differentially expressed genes by FDR
points(qlf_MDO_hits$table$logFC[which(qlf_MDO_hits$table$FDR < 0.05)],
       -log10(qlf_MDO_hits$table$PValue)[which(qlf_MDO_hits$table$FDR < 0.05)],
       col="red", pch=20)

# Adding a legend to the plot
legend("topright", title="Significance",
       legend=c("Not significant",
                "Significant (PValue < 0.05)",
                "Significant (FDR < 0.05)"),
       col=c("grey", "blue", "red"), pch=20, cex=0.8)

Likewise for PDO:

As it could be observed from the volcano plot, there are no red points, which parallels the fact that there are no genes that passed the correction.

4.1 Visualize your top hits using a heatmap.

tops <- rownames(qlf.Tumor_vs_MDO$table)[
  qlf.Tumor_vs_MDO$table$PValue < 0.05]

top_genes_MDO_matrix = t(scale(
  t(luad_matrix[which(rownames(luad_matrix) %in% tops),]) ))


if(min(top_genes_MDO_matrix) == 0){
heatmap_col = colorRamp2(c( 0, max(top_genes_MDO_matrix)),
c( "white", "red"))
} else {
heatmap_col = colorRamp2(c(min(top_genes_MDO_matrix), 0,
max(top_genes_MDO_matrix)),
c("blue", "white", "red"))
}


# annotations


unique_celltype <- unique(sample_type_df$`Cell Type`)
unique_celltypecolors <- rainbow(n = length(unique_celltype))
names(unique_celltypecolors) <- unique_celltype


unique_tissuetype <- unique(sample_type_df$`Tissue Type`)
unique_tissuecolors <- rainbow(n = length(unique_tissuetype))
names(unique_tissuecolors) <- unique_tissuetype


ha_pat <- HeatmapAnnotation(df = data.frame(cell_type = sample_type_df$`Cell Type`, 
                                            tissue_type = sample_type_df$`Tissue Type`),
                                            col = list(cell_type = unique_celltypecolors, 
                                                       tissue_type = unique_tissuecolors),
                                            show_legend = TRUE)

current_heatmap <- Heatmap(as.matrix(top_genes_MDO_matrix),
                          top_annotation = ha_pat,
                          cluster_rows = TRUE,
                          cluster_columns = TRUE,
                          show_row_dend = TRUE,
                          show_column_dend = TRUE, 
                          col=heatmap_col,
                          show_column_names = FALSE,
                          show_row_names = FALSE,
                          show_heatmap_legend = TRUE,
                          name="colour scale",
                          column_title = ("Figure 6: Heatmap of top hits across different cell types"))
current_heatmap

The heatmap was colored using a gradient that transitions from blue (representing lower expression levels) to white (median expression levels), and then to red (higher expression levels). This color scheme was chosen to intuitively display the range of expression levels, where drastic color changes correspond to significant differences in gene expression.

Additionally, I annotated the heatmap with metadata on cell type and tissue type, providing further context to the expression data. Each cell type and tissue type was assigned a unique color, and these annotations were aligned with the corresponding columns in the heatmap. This allows for easy observation of patterns or clusters in gene expression relative to specific cell or tissue types, which could indicate shared biological functions or conditions influencing gene expression.

The resulting heatmap, titled 'Figure 6: Heatmap of top hits across different cell types,' provides a comprehensive visual summary of the data, highlighting the gene expression variances across cell types and potentially pointing to biological mechanisms involved in tumor progression and metastasis. By clustering both the rows (genes) and columns (samples), the heatmap not only displays the gene expression but also suggests groups of genes or samples that behave similarly, which could be critical for identifying new targets for therapeutic intervention or for better understanding the molecular underpinnings of cancer metastasis."

4.2 Do your conditions cluster together? Explain why or why not.

Based on the heatmap provided in Figure 6, which displays gene expression patterns across different cell and tissue types, it is observable that conditions based on cell types (Tumor, MDO, and PDO) show specific clustering patterns, particularly with MDO and PDO often clustering together. This observation is further enhanced by including tissue type annotations, which offer deeper insights into the distribution and relationships of gene expression variations.

For Cell Type Clustering, Metastasis-Derived Organoids (MDO) and Patient-Derived Organoids (PDO) often appear to cluster together. This might suggest that the gene expression profiles in organoids, regardless of their derivation from primary or metastatic sites, share similarities likely due to common in vitro culture conditions or inherent properties of organoid biology. Both MDOs and PDOs are grown under similar laboratory conditions, which may standardize certain aspects of their gene expression, aligning them more closely with each other than with their original tumor counterparts. Tumor samples sometimes form distinct clusters apart from MDOs and PDOs, indicating unique expression profiles that might be reflective of the more complex in vivo tumor microenvironment.

Thresholded over-representation analysis

Task: With your significantly up-regulated and down-regulated set of genes run a thresholded gene set enrichment analysis

# create non-thresholded gene sets
MDO_nt_geneset <- qlf_MDO_hits$table
MDO_nt_geneset[,"rank"] <- -log10(MDO_nt_geneset$PValue) * sign(MDO_nt_geneset$logFC)
MDO_nt_geneset <- MDO_nt_geneset[order(MDO_nt_geneset$rank),]

PDO_nt_geneset <- qlf_PDO_hits$table
PDO_nt_geneset[,"rank"] <- -log10(PDO_nt_geneset$PValue) * sign(PDO_nt_geneset$logFC)
PDO_nt_geneset <- PDO_nt_geneset[order(PDO_nt_geneset$rank),]

1.1 Which method did you choose and why?

g:Profiler will be used to run over-representation analysis as that is the method taught in class, and I had practiced this tool using instructions from the online tutorial.

2.1 What annotation data did you use and why?

For the over-representation analysis (ORA) of gene sets, the annotation data sources chosen were "GO:BP" (Gene Ontology: Biological Process), "KEGG" (Kyoto Encyclopedia of Genes and Genomes), "REAC" (Reactome Pathways), and "WP" (WikiPathways).

WP, GO:BP, and REAC were suggested in the tutorial. Using GO:BP helps identify specific biological processes that are enriched in the gene set, providing insights into the biological mechanisms potentially altered in the study's conditions. Reactome offers an in-depth look at various biological pathways, including more comprehensive coverage of pathway reactions and interactions than some other resources. WikiPathways allows for the integration of the most current and frequently updated pathway information, contributed by a global community of researchers.

KEGG was added due to its excellence for understanding the complex interactions of pathways and how various genes contribute to specific metabolic and signaling pathways.

2.2 What version of the annotation are you using?

version_info <- gprofiler2::get_version_info(organism = "hsapiens")
gobp <- version_info$sources$`GO:BP`
# KEGG Pathways
kegg <- version_info$sources$KEGG

# Reactome Pathways
reactome <- version_info$sources$REAC

# WikiPathways
wikipathways <- version_info$sources$WP

The g:Profiler version used:

g:Profiler version: r version_info$gprofiler_version
biomaRt version: r version_info$biomart_version

The annotation sources used in query:

GO:BP version r gobp$version
REAC version r version_info$sources$REAC$version
WP version r version_info$sources$WP$version
KEGG version r version_info$sources$KEGG$version

3.1 How many genesets were returned with what thresholds?

4.1 Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

Question 3 and 4 will be performed together for better presentation and easier interpretation.

For the MDO group:

up_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05
                                    & MDO_nt_geneset$logFC > 0)]
down_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05
                                      & MDO_nt_geneset$logFC < 0)]

all_genes_MDO <- rownames(MDO_nt_geneset)[which(MDO_nt_geneset$PValue < 0.05)]

# Run ORA for up-regulated genes
gost_up <- gprofiler2::gost(
  query = up_genes_MDO,
  organism = "hsapiens",
  correction_method = "fdr",
  exclude_iea = TRUE,
  ordered_query = FALSE,
  significant = FALSE,
  sources = c("GO:BP", "KEGG", "REAC", "WP") 
)

# Run ORA for down-regulated genes
gost_down <- gprofiler2::gost(
  query = down_genes_MDO,
  organism = "hsapiens",
  correction_method = "fdr",
  exclude_iea = TRUE,
  ordered_query = FALSE,
  significant = FALSE,
  sources = c("GO:BP", "KEGG", "REAC", "WP")
)

# Run ORA for all genes
gost_all <- gprofiler2::gost(
  query = all_genes_MDO,
  organism = "hsapiens",
  correction_method = "fdr",
  exclude_iea = TRUE,
  ordered_query = FALSE,
  significant = FALSE,
  sources = c("GO:BP", "KEGG", "REAC", "WP")
)

# Create a data frame summarizing the ORA results
gost_summary <- data.frame(
  Category = c("Upregulated Genes", "Downregulated Genes", "All Genes"),
  Gene_Sets = c(nrow(gost_up$result), nrow(gost_down$result), nrow(gost_all$result))
)

knitr::kable(gost_summary, 
             caption = "Table 1: Summary of Gene Set Enrichment Analysis Results for MDO",
             align = 'c', 
             col.names = c("Category", "Number of Gene Sets"))

# Filter gene sets with a minimum of 5 genes and a maximum of 500 genes
gost_up_filtered <- gost_up$result[gost_up$result$term_size >= 5 & gost_up$result$term_size <= 500, ]
gost_down_filtered<- gost_down$result[gost_down$result$term_size >= 5 & gost_down$result$term_size <= 500, ]
gost_all_filtered <- gost_all$result[gost_all$result$term_size >= 5 & gost_all$result$term_size <= 500, ]

# Create a data frame summarizing the ORA results
gost_summary <- data.frame(
  Category = c("Upregulated Genes", "Downregulated Genes", "All Genes"),
  Gene_Sets = c(nrow(gost_up_filtered), nrow(gost_down_filtered), nrow(gost_all_filtered))
)

knitr::kable(gost_summary, 
             caption = "Table 2: Summary of Gene Set Enrichment Analysis Results for MDO After Filtering",
             align = 'c', 
             col.names = c("Category", "Number of Gene Sets"))

To Visualize Top Regulated Genes:

knitr::kable(gost_up_filtered[1:10, c("term_size", "term_name")], 
      row.names = FALSE, 
      caption = "**Table 3: Top significant gene sets for upregulated genes for the MDO Group**")

knitr::kable(gost_all_filtered[1:10, c("term_size", "term_name")], 
      row.names = FALSE, 
      caption = "**Table 5: Top significant gene sets for all genes for the MDO Group**")

Likewise for the PDO so it will not be repeated here.

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

The Original Paper mentioned that they founds overexpressed genes (TFPI2, IL1B, SAA1, and SAA2) and underexpressed genes (BTG2 and FGFBP1) in metastases. Thus, the ORA results do provide support to some extent:

Upregulated Genes: The significant enrichment in terms related to "CD8-positive, alpha-beta T cell activation" and various forms of T cell proliferation (e.g., "alpha-beta T cell proliferation") suggests an active immune response or an immune-related mechanism in the MDO context. The mention of genes like TFPI2 and IL1B, which are known for their roles in inflammation and immune responses, correlates well with these enriched immune-related pathways. TFPI2 is known for its anti-inflammatory properties, while IL1B is a critical cytokine in the immune response.

The presence of pathways such as "amyloid-beta clearance" might indicate additional cellular functions that are upregulated in metastases, possibly related to cellular stress responses or other neuroinflammatory processes.

Downregulated Genes: The downregulation of pathways like "regulation of DNA repair" and "regulation of MAP kinase activity" might reflect changes in cell cycle control and DNA damage response, supporting the downregulation of genes like BTG2 which is known for its role in cell cycle regulation and tumor suppression. The "regulation of cellular response to stress" and "regulation of cell cycle G1/S phase transition" being affected aligns with the downregulation of FGFBP1, a gene implicated in cell proliferation and extracellular matrix interaction during cancer progression.

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

Studies have shown that CD8+ T cells play a crucial role in the immune surveillance of tumors and can be activated in response to tumor antigens[@GattiMays2021real]. The enrichment of T cell activation pathways suggests a possible immune evasion mechanism or a heightened immune response in the metastatic niche that could be related to the overexpression of inflammatory genes like IL1B.

Research indicates that disruptions in DNA repair mechanisms are a hallmark of many cancers, leading to genomic instability and tumor progression [@hamidi2018every]. The downregulation of these pathways in MDOs might reflect a compromised DNA repair capacity typical of more aggressive or advanced-stage cancers, supporting the observed downregulation of BTG2.

The MAP kinase pathway is integral to cell growth and differentiation signals. Its dysregulation is linked to various cancers (Source: Cancer Discovery, 2015, 5(7), 730–743. DOI: 10.1158/2159-8290.CD-14-0736). The downregulation of this pathway could indicate a disruption in these signaling processes, consistent with the cancer progression and metastatic behavior noted in the study.

Assignment 2 - bcb420-2025/Keren_Zhang GitHub Wiki

Table of Contents

Time

Data Import

Download

Load Normalized Data

Differential Gene Expression

Limma MDS

Model Setup

DGE - EdgeR

1.1 Calculate p-values for each of the genes in your expression set.

1.2 How many genes were significantly differentially expressed?

1.3 What thresholds did you use and why

2.1 Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why?

2.2 How many genes passed correction?

3.1 Show the amount of differentially expressed genes using an MA Plot

3.2 Show the amount of differentially expressed genes using an Volcano Plot

4.1 Visualize your top hits using a heatmap.

4.2 Do your conditions cluster together? Explain why or why not.

Thresholded over-representation analysis

1.1 Which method did you choose and why?

2.1 What annotation data did you use and why?

2.2 What version of the annotation are you using?

3.1 How many genesets were returned with what thresholds?

4.1 Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

⚠️ GitHub.com Fallback ⚠️

Assignment 2 - bcb420-2025/Keren_Zhang GitHub Wiki

Table of Contents

Time

Data Import

Download

Load Normalized Data

Differential Gene Expression

Limma MDS

Model Setup

DGE - EdgeR

1.1 Calculate p-values for each of the genes in your expression set.

1.2 How many genes were significantly differentially expressed?

1.3 What thresholds did you use and why

2.1 Multiple hypothesis testing - correct your p-values using a multiple hypothesis correction method. Which method did you use? And Why?

2.2 How many genes passed correction?

3.1 Show the amount of differentially expressed genes using an MA Plot

3.2 Show the amount of differentially expressed genes using an Volcano Plot

4.1 Visualize your top hits using a heatmap.

4.2 Do your conditions cluster together? Explain why or why not.

Thresholded over-representation analysis

1.1 Which method did you choose and why?

2.1 What annotation data did you use and why?

2.2 What version of the annotation are you using?

3.1 How many genesets were returned with what thresholds?

4.1 Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?

Interpretation

1. Do the over-representation results support conclusions or mechanism discussed in the original paper?

2. Can you find evidence, i.e. publications, to support some of the results that you see. How does this evidence support your results.

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️