Entry 9: Assignment 1 Workflow - bcb420-2025/Izumi

Notes Overall

⏰ - expected-time:actual-time 10:17.75 hours

This assignment took a lot more time than expected but a lot of that time was taken in understanding what I had to do
I did not encounter too many errors hence this journal entry is quite short. I think this was because a lot of the plotting and processing used the code provided in class and the errors that came about from that code were easily adjustable.
I did the whole building / analysis process in the BCB420 docker

Day 1 - Feb 7th, 2025

Approach: just start the assignment, see how far you get to gage how much time things take, which parts you get stuck on.

⏰ : 23:15~24:16

Workflow

Read through the assignment instructions, reviewed week 4 "Get the Data" lecture and got to the point of downloading the data.

# within the BCB420 docker container

installed.packages()
# installing GEOquery bc is was not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GEOquery")

library("GEOquery")

# getting the GEO description
geoID <- "GSE201427"
# figure out a way to access the cache data if available 
gse <- GEOquery::getGEO(geoID, GSEMatrix=FALSE)
# gse@header$summary
suppFiles <- GEOquery::getGEOSuppFiles(geoID, fetch_files=FALSE)
# we can see that there is only one file available (from the env tab)
dataFile <- suppFiles$fname[1]

# only downloading dataset if it is not available
dir <- file.path(getwd())
dataFilePath <- file.path(dir, geoID, dataFile)
if(!file.exists(dataFilePath)){
  dataFileDownload <- GEOquery::getGEOSuppFiles(geoID, 
                                                filter_regex = dataFile, 
                                                baseDir = dir, 
                                                fetch_files = TRUE)
}

# reading in the data
panc1Data <- read.table(dataFilePath, header = TRUE, check.names = TRUE)
dim(panc1Data)

Next Steps

idetify information you need from the publication (probably want a figure if possible)
read the associated publication in detail -> write a rationale for the experiement you want to focus on
make a summary of the data using grave accents
clean or remove outliers, make sure to clarify your rationale for this step as well
review lecture on normalization, decide which method to use
normalize data
put together rough draft of the report
make a list of next to do's

Day 2 - Feb 8th, 2025

⏰ : 11:00-12:15

Summary of Activities

fixed data read in error
clarified next steps

Notes

Realized that the data had not loaded properly and there were weird characters
Learned that xlsx files have different encoding so we have to use a special package to load the data instead of read.table

# install.packages("tidyverse") 
library("tidyverse")
# this package is included in tidyverse but needs to be loaded explicitly bc it is not core
library("readxl")

panc1Data2 <- readxl::read_xlsx(dataFilePath)

got a little confused about what I needed to do so I made a checklist

Day 3 - Feb 10th, 2025

⏰ : 15:00~17:00

Summary of Activities

read through the associated pulication
plotted out the rainbow density plot
made a list form outline of GSE201427 (not included in journal but included in the final submission)

Notes

Selected Experiment: Pac1 cells treated with control siRNA (control group) and siRNA targeting SF3B1 (to knock out SF3B1).
Other experimental conditions: exposed to hypoxic conditions (1% O2) for 8 hours.

The purpose of this experiment was to study the differences in gene expression related to HIF-1 with and without the presence of SF3B1

Notes from Associated Publication
PMID: 36001976

SF3B1 is a splicing factor
Mutations in SF3B1 frequently occur in multiple types of cancer
These mutations can drive tumor progression by activating cryptic splice sites in multiple genes
Main Argument: "SF3B1 is a HIF-1 (hypoxia inducible factor) target gene that positively regulates HIF-1 complex pathway activity"
Why this is important: this could be a "potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors"
hypoxia : low levels of oxygen in tissue
Hypothesis for SF3B1's contribution to tumor progression: Upregulation of wild-type SF3B1 in tumors allows for adaptation to hypoxia. (Because, in the heart, overexpression of SF3B1 is induced by hypoxia and solid cancers are poorly oxygenated.)
Quote about my dataset: "Differential RNA-seq in PANC1 cells further enabled us to define a subset of 86 of 192 direct HIF1a target genes (with HIF1a ChIP peaks within a 3 kb distance) that are dependent on SF3B1 for transcriptional upregulation upon hypoxia (Figure 3F)."

Figure 3F from the publication

Questions

what is the HIF-1 complex pathway
how is the effect of SF3B1 on the HIF-1 pathway related to tumor progression

Trying to plot the rainbow density plot

initially copied the code from the lecture notes but I kept on getting the error below

> numeric_data <- panc1Data2_sub[, 3:ncol(panc1Data2_sub)]
> log2(numeric_data)
# A tibble: 15,176 × 6
   `A-2_Panc1_siCtrl_Hx` `A-5_Panc1_siCtrl_Hx`
                   <dbl>                 <dbl>
 1               NaN                    -2.91 
 2                 2.15                  2.15 
 3                 0.716                 0.947
 4               NaN                   NaN    
 5                 2.22                  2.22 
 6               NaN                   NaN    
 7                 1.50                  1.53 
 8                 2.42                  2.59 
 9                 1.71                  1.64 
10                 1.72                  1.63

after trying to find the rows that kept on getting NaN values and looking at the original data I realized that my data was already in log-CPM format and log2 was giving me errors because you cannot take the log of a negative number

Day 4 - Feb 11th, 2025

Literally everything else. This is taking more time than I had hoped.

⏰ : 8:30~20:00 (with maybe 2 hours total of break in between)

Overview of the things that I did today
✅ Created the box plot to access the original data
✅ Mapped the EntrezIDs to HUGO genes (only to realize that the dataset already had better mappings that the tool could provide)
✅ Filtered out small values using the same threshold as the lecture
✅ De-normalized the data (that was already normalized to log2CPM) by manually finding read depth values for each sample from SRA
✅ Put together bibliography
💬 Tried to straighten everything out...

Some Issues that I Faced

Not being able to use biomaRt for the conversions. Couldn't fully dicypher the error message but it seemed like the library for mapping EntrezIDs was too small? -> This was resolved using the org.Hs.eg.db package
The dataset was already normalized but it was in log2CPM which is NOT the best for differential expression analysis (which we will be doing later in the course) -> literally reverted the calculations from raw counts to log2-CPM by manually searching for the read depth for each experiment on the SRA (only to find that TMM normalized data looks nearly identical to log2-CPM data). Details on what I did are in the [final submission}(https://github.com/bcb420-2025/Izumi_Ando/blob/main/A1_Izumi_Ando/Izumi_Ando.html).

Day 5 - Feb 12th, 2025 (extension day)

⏰ : 16:00~19:50

Overview of the things that I did today

Went back in and fixed up some formatting
Added in comments fixed some spelling errors
Made sure the Rmd built properly by opening a new docker container - I actually caught one small error where a package could not be installed from CRAN (resolved when switched to BioConductor)
Adding in an internal navigation / link system with the following code

<a id="some-id"></a> 
[whatever text](#some-id)

Entry 9: Assignment 1 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki

Notes Overall

Day 1 - Feb 7th, 2025

Workflow

Next Steps

Day 2 - Feb 8th, 2025

Summary of Activities

Notes

Day 3 - Feb 10th, 2025

Summary of Activities

Notes

Day 4 - Feb 11th, 2025

Day 5 - Feb 12th, 2025 (extension day)

⚠️ GitHub.com Fallback ⚠️

Entry 9: Assignment 1 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki

Notes Overall

Day 1 - Feb 7th, 2025

Workflow

Next Steps

Day 2 - Feb 8th, 2025

Summary of Activities

Notes

Day 3 - Feb 10th, 2025

Summary of Activities

Notes

Day 4 - Feb 11th, 2025

Day 5 - Feb 12th, 2025 (extension day)

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️