Entry 12: Assignment 2 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki

Overview

Estimate of total amount of time this assignment is going to take⏰: 12 hours - actual time 14 hours 40 mins
This assignment had less code-related difficulties as ORA was done on the g:profiler web app, and I got used to using R Markdown! 🎉

Day 1 - March 9th, 2025

Objectives

Create the outline of the final submission document ✅
Load in the data from Assignment 1 ✅ - 30 mins
Finish first run of DE section (all the computation minus the interpretation & writing & plot organizing) ✅ - 1.75 hours

Estimate time to completion: 3 hours - took 2.25 hours

Notes on the environment

Everything was created using the BCB420 docker environment

Loading in data from A1

took necessary code blocks from A1 submission file
noticed a typo in the code block looking for replicate data (using the length function instead of nrow) which was displaying wrong data so fixed that. (below)

# checking for rows with the same read values by viewing data from both ends
duplicates <- duplicated(filtered_panc1_data) | duplicated(filtered_panc1_data, 
                                                           fromLast = TRUE)
num_dups <- nrow(filtered_panc1_data[duplicates, ]) # this was previously using length()

# checking for rows with the same gene names
# geneCol is a list of HUGO symbols from the previous section
num_dup_genes <- length(unique(geneCol)) - length(geneCol)

cat("There are", num_dups, "rows with the same read values and", num_dup_genes, 
    "rows with the same gene name.")

I will still need to add text to explain each step, in less detail compared to A1 obviously but with some plots and explanations as to what is going on

Differential Expression Analysis

my model only includes 1 factor because there is only one variable in my dataset (siCtrl and siSF3B1)
I referred to Lecture 7 Differential Analysis to use edgeR's tools
It took me some time to figure out the difference between the group, coef, and design arguments used in edgeR's function but I realized that my variables from A1 were referring to slightly different things than the lecture example as I had modified the data to better fit my use case. I realized that group referred to the actual groupings, design referred to the model, and coef should refer to the column name of the factor in the model.

samples_type_mod <- samples_type[-c(1, 2),]
model_design <- model.matrix(~samples_type_mod$`siRNA Treatment`)
data <- edgeR::DGEList(counts=normalized_counts, 
                       group=samples_type_mod$`siRNA Treatment`)
normalized_counts <- edgeR::estimateDisp(data, design=model_design)
fit <- edgeR::glmQLFit(data, design=model_design)
qlf.siCtrl_vs_siSF3B1 <- edgeR::glmQLFTest(fit, coef="samples_type_mod$`siRNA Treatment`siSF3B1")

Interestingly, my hit rate for the p-value threshold and correction threshold were very high. Will need to do some analysis to figure out why my hit rate is so much higher than the lecture example.

nrow(normalized_counts) # returned 12909
length(which(qlf_output_hits$table$PValue < 0.05)) # returned 8710
length(which(qlf_output_hits$table$FDR < 0.05)) # returned 8293

Inital Heatmap First try of the heatmap

Day 2 - March 11th, 2025

Objectives

Do the ORA journal assignment ✅ - expected 1 hour - actual 50 mins
Do the ORA for A2 - expected 2 hours - moved to next day
Examine the DE data with volcano plot ✅ - expected 30 mins - actual 40 mins
Figure out what thresholds to use ✅ - expected 20 mins - actual 1 hour
Write out Data overview text ✅ - expected 1 hour - actual 2 hours

Notes

decided to put all package installation and loading in the first section to make it easier to track

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
  BiocManager::install("GEOquery")
}
if (!requireNamespace("tidyverse", quietly = TRUE)) {
  install.packages("tidyverse")
} # readxl included in tidyverse install
if (!require("knitr", quietly = TRUE)) {
  install.packages("knitr")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
  install.packages("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
  BioManager::install("limma")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
  install.packages("ggplot2")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
  BioManager::install("ComplexHeatmap")
}
if (!requireNamespace("circlize", quietly = TRUE)) {
  install.packages("circlize")
}

# 1 - DATASET OVERVIEW
library("GEOquery") # to download dataset
library("tidyverse") # to use readxl
library("readxl")  # to read the datafile
library("knitr") # to produce visually cleaner tables
library("edgeR") # for TMM normalization & differential expression analysis
library("limma") # to generate the MDS plot

# 2 - DIFFERENTIAL GENE EXPRESSION ANALYSIS
library("ggplot2") # to create volcano plot
library("ComplexHeatmap") # to create heatmaps
library("circlize") # to adjust heatmaps

ORA journal assignment: Entry-14%3A-G%3AProfiler-Assignment - realized that I should use webtool instead of the R version
no issues with creating volcano plot. learned that by coloring the significant genes, it is a good way to visually examine the affect of changing the corrected p-value thresholds & FDR thresholds when creating the thresholded lists.
Volcano plot from report
determined thresholds by examining volcano plots, creating a table with the number of genes that remain with different thresholds AND by checking the results given by g:profiler. turns out thousands of genes have a corrected p-value under 0.005 ➡ decided on 0.001 FDR and 1.25 log2FC
although there were no technical issues moving data from A1 to A2, writing out a condensed version of A1 took a lot more time than expected.
🚨 emergency came up mid-day so decided to take the late penalty to finish assignment

Day 3 - March 12th, 2025

Objectives

Do the ORA for A2 ✅ - expected 2 hours - 1.75 hours

Notes

selected go:BP for its coverage, Reactome because it is well maintained, and KEGG because it has an emphasis for disease pathways which is relevant as the original study focused on cancer
realized that unlike the results in te g:profiler journal assignment, changing the geneset threshold did not do much in terms of returning different results as the top terms were not as broad

Day 4 - March 13th, 2025

Objectives

Add in plots from A1 ✅ - expected 20 mins - actual 40 mins
Add in the figure legends ✅ - expected 2 hours - actual 2.5 hours
Add in the text for the final analysis & conclusions ✅ - expected 2 hours - actual 2 hours

Notes

changed some of the labeling for A1 plots
interesting to find that there were no significant biological pathways that were upregulated but very significantly downregulated pathways (DNA repair and replication)
given how the upregulated genes did not seem to be significantly correlated with a biological pathway, the combined geneset returned near identical results as the down regulated gene query
also interesting to note that the although the original publication mentions that SF3B1 has the important role of enhancing HIF1-signaling which controls genes in pathways related to metabolism (glycolysis) and apoptosis, neither of those were observed in the ORA results. not sure what this means or why just yet.