Entry 12: Assignment 2 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki
Overview
- Estimate of total amount of time this assignment is going to takeβ°: 12 hours - actual time 14 hours 40 mins
- This assignment had less code-related difficulties as ORA was done on the
g:profiler
web app, and I got used to using R Markdown! π
Day 1 - March 9th, 2025
Objectives
- Create the outline of the final submission document β
- Load in the data from Assignment 1 β - 30 mins
- Finish first run of DE section (all the computation minus the interpretation & writing & plot organizing) β - 1.75 hours
Estimate time to completion: 3 hours - took 2.25 hours
Notes on the environment
- Everything was created using the BCB420 docker environment
Loading in data from A1
- took necessary code blocks from A1 submission file
- noticed a typo in the code block looking for replicate data (using the
length
function instead ofnrow
) which was displaying wrong data so fixed that. (below)
# checking for rows with the same read values by viewing data from both ends
duplicates <- duplicated(filtered_panc1_data) | duplicated(filtered_panc1_data,
fromLast = TRUE)
num_dups <- nrow(filtered_panc1_data[duplicates, ]) # this was previously using length()
# checking for rows with the same gene names
# geneCol is a list of HUGO symbols from the previous section
num_dup_genes <- length(unique(geneCol)) - length(geneCol)
cat("There are", num_dups, "rows with the same read values and", num_dup_genes,
"rows with the same gene name.")
- I will still need to add text to explain each step, in less detail compared to A1 obviously but with some plots and explanations as to what is going on
Differential Expression Analysis
- my model only includes 1 factor because there is only one variable in my dataset (siCtrl and siSF3B1)
- I referred to Lecture 7 Differential Analysis to use
edgeR
's tools - It took me some time to figure out the difference between the
group
,coef
, anddesign
arguments used inedgeR
's function but I realized that my variables from A1 were referring to slightly different things than the lecture example as I had modified the data to better fit my use case. I realized thatgroup
referred to the actual groupings,design
referred to the model, andcoef
should refer to the column name of the factor in the model.
samples_type_mod <- samples_type[-c(1, 2),]
model_design <- model.matrix(~samples_type_mod$`siRNA Treatment`)
data <- edgeR::DGEList(counts=normalized_counts,
group=samples_type_mod$`siRNA Treatment`)
normalized_counts <- edgeR::estimateDisp(data, design=model_design)
fit <- edgeR::glmQLFit(data, design=model_design)
qlf.siCtrl_vs_siSF3B1 <- edgeR::glmQLFTest(fit, coef="samples_type_mod$`siRNA Treatment`siSF3B1")
- Interestingly, my hit rate for the p-value threshold and correction threshold were very high. Will need to do some analysis to figure out why my hit rate is so much higher than the lecture example.
nrow(normalized_counts) # returned 12909
length(which(qlf_output_hits$table$PValue < 0.05)) # returned 8710
length(which(qlf_output_hits$table$FDR < 0.05)) # returned 8293
First try of the heatmap
Day 2 - March 11th, 2025
Objectives
- Do the ORA journal assignment β - expected 1 hour - actual 50 mins
- Do the ORA for A2 - expected 2 hours - moved to next day
- Examine the DE data with volcano plot β - expected 30 mins - actual 40 mins
- Figure out what thresholds to use β - expected 20 mins - actual 1 hour
- Write out Data overview text β - expected 1 hour - actual 2 hours
Notes
- decided to put all package installation and loading in the first section to make it easier to track
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
if (!requireNamespace("GEOquery", quietly = TRUE)) {
BiocManager::install("GEOquery")
}
if (!requireNamespace("tidyverse", quietly = TRUE)) {
install.packages("tidyverse")
} # readxl included in tidyverse install
if (!require("knitr", quietly = TRUE)) {
install.packages("knitr")
}
if (!requireNamespace("edgeR", quietly = TRUE)) {
install.packages("edgeR")
}
if (!requireNamespace("limma", quietly = TRUE)) {
BioManager::install("limma")
}
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
if (!requireNamespace("ComplexHeatmap", quietly = TRUE)) {
BioManager::install("ComplexHeatmap")
}
if (!requireNamespace("circlize", quietly = TRUE)) {
install.packages("circlize")
}
# 1 - DATASET OVERVIEW
library("GEOquery") # to download dataset
library("tidyverse") # to use readxl
library("readxl") # to read the datafile
library("knitr") # to produce visually cleaner tables
library("edgeR") # for TMM normalization & differential expression analysis
library("limma") # to generate the MDS plot
# 2 - DIFFERENTIAL GENE EXPRESSION ANALYSIS
library("ggplot2") # to create volcano plot
library("ComplexHeatmap") # to create heatmaps
library("circlize") # to adjust heatmaps
- ORA journal assignment: Entry-14%3A-G%3AProfiler-Assignment - realized that I should use webtool instead of the R version
- no issues with creating volcano plot. learned that by coloring the significant genes, it is a good way to visually examine the affect of changing the corrected p-value thresholds & FDR thresholds when creating the thresholded lists.
Volcano plot from report - determined thresholds by examining volcano plots, creating a table with the number of genes that remain with different thresholds AND by checking the results given by
g:profiler
. turns out thousands of genes have a corrected p-value under 0.005 β‘ decided on 0.001 FDR and 1.25 log2FC - although there were no technical issues moving data from A1 to A2, writing out a condensed version of A1 took a lot more time than expected.
- π¨ emergency came up mid-day so decided to take the late penalty to finish assignment
Day 3 - March 12th, 2025
Objectives
- Do the ORA for A2 β - expected 2 hours - 1.75 hours
Notes
- selected go:BP for its coverage, Reactome because it is well maintained, and KEGG because it has an emphasis for disease pathways which is relevant as the original study focused on cancer
- realized that unlike the results in te
g:profiler
journal assignment, changing the geneset threshold did not do much in terms of returning different results as the top terms were not as broad
Day 4 - March 13th, 2025
Objectives
- Add in plots from A1 β - expected 20 mins - actual 40 mins
- Add in the figure legends β - expected 2 hours - actual 2.5 hours
- Add in the text for the final analysis & conclusions β - expected 2 hours - actual 2 hours
Notes
- changed some of the labeling for A1 plots
- interesting to find that there were no significant biological pathways that were upregulated but very significantly downregulated pathways (DNA repair and replication)
- given how the upregulated genes did not seem to be significantly correlated with a biological pathway, the combined geneset returned near identical results as the down regulated gene query
- also interesting to note that the although the original publication mentions that SF3B1 has the important role of enhancing HIF1-signaling which controls genes in pathways related to metabolism (glycolysis) and apoptosis, neither of those were observed in the ORA results. not sure what this means or why just yet.