Entry 9: Assignment 1 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki
⏰ - expected-time:actual-time 10:17.75 hours
- This assignment took a lot more time than expected but a lot of that time was taken in understanding what I had to do
- I did not encounter too many errors hence this journal entry is quite short. I think this was because a lot of the plotting and processing used the code provided in class and the errors that came about from that code were easily adjustable.
- I did the whole building / analysis process in the BCB420 docker
Approach: just start the assignment, see how far you get to gage how much time things take, which parts you get stuck on.
⏰ : 23:15~24:16
Read through the assignment instructions, reviewed week 4 "Get the Data" lecture and got to the point of downloading the data.
# within the BCB420 docker container
installed.packages()
# installing GEOquery bc is was not installed
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
library("GEOquery")
# getting the GEO description
geoID <- "GSE201427"
# figure out a way to access the cache data if available
gse <- GEOquery::getGEO(geoID, GSEMatrix=FALSE)
# gse@header$summary
suppFiles <- GEOquery::getGEOSuppFiles(geoID, fetch_files=FALSE)
# we can see that there is only one file available (from the env tab)
dataFile <- suppFiles$fname[1]
# only downloading dataset if it is not available
dir <- file.path(getwd())
dataFilePath <- file.path(dir, geoID, dataFile)
if(!file.exists(dataFilePath)){
dataFileDownload <- GEOquery::getGEOSuppFiles(geoID,
filter_regex = dataFile,
baseDir = dir,
fetch_files = TRUE)
}
# reading in the data
panc1Data <- read.table(dataFilePath, header = TRUE, check.names = TRUE)
dim(panc1Data)
- idetify information you need from the publication (probably want a figure if possible)
- read the associated publication in detail -> write a rationale for the experiement you want to focus on
- make a summary of the data using grave accents
- clean or remove outliers, make sure to clarify your rationale for this step as well
- review lecture on normalization, decide which method to use
- normalize data
- put together rough draft of the report
- make a list of next to do's
⏰ : 11:00-12:15
- fixed data read in error
- clarified next steps
- Realized that the data had not loaded properly and there were weird characters
- Learned that xlsx files have different encoding so we have to use a special package to load the data instead of
read.table
# install.packages("tidyverse")
library("tidyverse")
# this package is included in tidyverse but needs to be loaded explicitly bc it is not core
library("readxl")
panc1Data2 <- readxl::read_xlsx(dataFilePath)
- got a little confused about what I needed to do so I made a checklist
⏰ : 15:00~17:00
- read through the associated pulication
- plotted out the rainbow density plot
- made a list form outline of GSE201427 (not included in journal but included in the final submission)
Selected Experiment:
Pac1 cells treated with control siRNA (control group) and siRNA targeting SF3B1 (to knock out SF3B1).
Other experimental conditions: exposed to hypoxic conditions (1% O2) for 8 hours.
- The purpose of this experiment was to study the differences in gene expression related to HIF-1 with and without the presence of SF3B1
Notes from Associated Publication
PMID: 36001976
- SF3B1 is a splicing factor
- Mutations in SF3B1 frequently occur in multiple types of cancer
- These mutations can drive tumor progression by activating cryptic splice sites in multiple genes
- Main Argument: "SF3B1 is a HIF-1 (hypoxia inducible factor) target gene that positively regulates HIF-1 complex pathway activity"
- Why this is important: this could be a "potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors"
- hypoxia : low levels of oxygen in tissue
- Hypothesis for SF3B1's contribution to tumor progression: Upregulation of wild-type SF3B1 in tumors allows for adaptation to hypoxia. (Because, in the heart, overexpression of SF3B1 is induced by hypoxia and solid cancers are poorly oxygenated.)
- Quote about my dataset: "Differential RNA-seq in PANC1 cells further enabled us to define a subset of 86 of 192 direct HIF1a target genes (with HIF1a ChIP peaks within a 3 kb distance) that are dependent on SF3B1 for transcriptional upregulation upon hypoxia (Figure 3F)."

Figure 3F from the publication
Questions
- what is the HIF-1 complex pathway
- how is the effect of SF3B1 on the HIF-1 pathway related to tumor progression
Trying to plot the rainbow density plot
- initially copied the code from the lecture notes but I kept on getting the error below
> numeric_data <- panc1Data2_sub[, 3:ncol(panc1Data2_sub)]
> log2(numeric_data)
# A tibble: 15,176 × 6
`A-2_Panc1_siCtrl_Hx` `A-5_Panc1_siCtrl_Hx`
<dbl> <dbl>
1 NaN -2.91
2 2.15 2.15
3 0.716 0.947
4 NaN NaN
5 2.22 2.22
6 NaN NaN
7 1.50 1.53
8 2.42 2.59
9 1.71 1.64
10 1.72 1.63
- after trying to find the rows that kept on getting NaN values and looking at the original data I realized that my data was already in log-CPM format and
log2
was giving me errors because you cannot take the log of a negative number
Literally everything else. This is taking more time than I had hoped.
⏰ : 8:30~20:00 (with maybe 2 hours total of break in between)
Overview of the things that I did today
✅ Created the box plot to access the original data
✅ Mapped the EntrezIDs to HUGO genes (only to realize that the dataset already had better mappings that the tool could provide)
✅ Filtered out small values using the same threshold as the lecture
✅ De-normalized the data (that was already normalized to log2CPM) by manually finding read depth values for each sample from SRA
✅ Put together bibliography
💬 Tried to straighten everything out...
Some Issues that I Faced
- Not being able to use
biomaRt
for the conversions. Couldn't fully dicypher the error message but it seemed like the library for mapping EntrezIDs was too small? -> This was resolved using theorg.Hs.eg.db
package - The dataset was already normalized but it was in log2CPM which is NOT the best for differential expression analysis (which we will be doing later in the course) -> literally reverted the calculations from raw counts to log2-CPM by manually searching for the read depth for each experiment on the SRA (only to find that TMM normalized data looks nearly identical to log2-CPM data). Details on what I did are in the [final submission}(https://github.com/bcb420-2025/Izumi_Ando/blob/main/A1_Izumi_Ando/Izumi_Ando.html).
⏰ : 16:00~19:50
Overview of the things that I did today
- Went back in and fixed up some formatting
- Added in comments fixed some spelling errors
- Made sure the Rmd built properly by opening a new docker container - I actually caught one small error where a package could not be installed from CRAN (resolved when switched to BioConductor)
- Adding in an internal navigation / link system with the following code
<a id="some-id"></a>
[whatever text](#some-id)