Entry 9: Assignment 1 Workflow - bcb420-2025/Izumi_Ando GitHub Wiki

Notes Overall

⏰ - expected-time:actual-time 10:17.75 hours

  • This assignment took a lot more time than expected but a lot of that time was taken in understanding what I had to do
  • I did not encounter too many errors hence this journal entry is quite short. I think this was because a lot of the plotting and processing used the code provided in class and the errors that came about from that code were easily adjustable.
  • I did the whole building / analysis process in the BCB420 docker

Day 1 - Feb 7th, 2025

Approach: just start the assignment, see how far you get to gage how much time things take, which parts you get stuck on.

⏰ : 23:15~24:16

Workflow

Read through the assignment instructions, reviewed week 4 "Get the Data" lecture and got to the point of downloading the data.

# within the BCB420 docker container

installed.packages()
# installing GEOquery bc is was not installed
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GEOquery")

library("GEOquery")

# getting the GEO description
geoID <- "GSE201427"
# figure out a way to access the cache data if available 
gse <- GEOquery::getGEO(geoID, GSEMatrix=FALSE)
# gse@header$summary
suppFiles <- GEOquery::getGEOSuppFiles(geoID, fetch_files=FALSE)
# we can see that there is only one file available (from the env tab)
dataFile <- suppFiles$fname[1]

# only downloading dataset if it is not available
dir <- file.path(getwd())
dataFilePath <- file.path(dir, geoID, dataFile)
if(!file.exists(dataFilePath)){
  dataFileDownload <- GEOquery::getGEOSuppFiles(geoID, 
                                                filter_regex = dataFile, 
                                                baseDir = dir, 
                                                fetch_files = TRUE)
}

# reading in the data
panc1Data <- read.table(dataFilePath, header = TRUE, check.names = TRUE)
dim(panc1Data)

Next Steps

  • idetify information you need from the publication (probably want a figure if possible)
  • read the associated publication in detail -> write a rationale for the experiement you want to focus on
  • make a summary of the data using grave accents
  • clean or remove outliers, make sure to clarify your rationale for this step as well
  • review lecture on normalization, decide which method to use
  • normalize data
  • put together rough draft of the report
  • make a list of next to do's

Day 2 - Feb 8th, 2025

⏰ : 11:00-12:15

Summary of Activities

  • fixed data read in error
  • clarified next steps

Notes

  • Realized that the data had not loaded properly and there were weird characters
  • Learned that xlsx files have different encoding so we have to use a special package to load the data instead of read.table
# install.packages("tidyverse") 
library("tidyverse")
# this package is included in tidyverse but needs to be loaded explicitly bc it is not core
library("readxl")

panc1Data2 <- readxl::read_xlsx(dataFilePath) 
  • got a little confused about what I needed to do so I made a checklist Screenshot 2025-02-11 at 9 05 37 AM

Day 3 - Feb 10th, 2025

⏰ : 15:00~17:00

Summary of Activities

  • read through the associated pulication
  • plotted out the rainbow density plot
  • made a list form outline of GSE201427 (not included in journal but included in the final submission)

Notes

Selected Experiment: Pac1 cells treated with control siRNA (control group) and siRNA targeting SF3B1 (to knock out SF3B1).
Other experimental conditions: exposed to hypoxic conditions (1% O2) for 8 hours.

  • The purpose of this experiment was to study the differences in gene expression related to HIF-1 with and without the presence of SF3B1

Notes from Associated Publication
PMID: 36001976

  • SF3B1 is a splicing factor
  • Mutations in SF3B1 frequently occur in multiple types of cancer
  • These mutations can drive tumor progression by activating cryptic splice sites in multiple genes
  • Main Argument: "SF3B1 is a HIF-1 (hypoxia inducible factor) target gene that positively regulates HIF-1 complex pathway activity"
  • Why this is important: this could be a "potential explanation for the link between high SF3B1 expression and aggressiveness of solid tumors"
  • hypoxia : low levels of oxygen in tissue
  • Hypothesis for SF3B1's contribution to tumor progression: Upregulation of wild-type SF3B1 in tumors allows for adaptation to hypoxia. (Because, in the heart, overexpression of SF3B1 is induced by hypoxia and solid cancers are poorly oxygenated.)
  • Quote about my dataset: "Differential RNA-seq in PANC1 cells further enabled us to define a subset of 86 of 192 direct HIF1a target genes (with HIF1a ChIP peaks within a 3 kb distance) that are dependent on SF3B1 for transcriptional upregulation upon hypoxia (Figure 3F)."
Figure3F

Figure 3F from the publication

Questions

  • what is the HIF-1 complex pathway
  • how is the effect of SF3B1 on the HIF-1 pathway related to tumor progression

Trying to plot the rainbow density plot

  • initially copied the code from the lecture notes but I kept on getting the error below
> numeric_data <- panc1Data2_sub[, 3:ncol(panc1Data2_sub)]
> log2(numeric_data)
# A tibble: 15,176 × 6
   `A-2_Panc1_siCtrl_Hx` `A-5_Panc1_siCtrl_Hx`
                   <dbl>                 <dbl>
 1               NaN                    -2.91 
 2                 2.15                  2.15 
 3                 0.716                 0.947
 4               NaN                   NaN    
 5                 2.22                  2.22 
 6               NaN                   NaN    
 7                 1.50                  1.53 
 8                 2.42                  2.59 
 9                 1.71                  1.64 
10                 1.72                  1.63 
  • after trying to find the rows that kept on getting NaN values and looking at the original data I realized that my data was already in log-CPM format and log2 was giving me errors because you cannot take the log of a negative number

Day 4 - Feb 11th, 2025

Literally everything else. This is taking more time than I had hoped.

⏰ : 8:30~20:00 (with maybe 2 hours total of break in between)

Overview of the things that I did today
✅ Created the box plot to access the original data
✅ Mapped the EntrezIDs to HUGO genes (only to realize that the dataset already had better mappings that the tool could provide)
✅ Filtered out small values using the same threshold as the lecture
✅ De-normalized the data (that was already normalized to log2CPM) by manually finding read depth values for each sample from SRA
✅ Put together bibliography
💬 Tried to straighten everything out...

Some Issues that I Faced

  • Not being able to use biomaRt for the conversions. Couldn't fully dicypher the error message but it seemed like the library for mapping EntrezIDs was too small? -> This was resolved using the org.Hs.eg.db package
  • The dataset was already normalized but it was in log2CPM which is NOT the best for differential expression analysis (which we will be doing later in the course) -> literally reverted the calculations from raw counts to log2-CPM by manually searching for the read depth for each experiment on the SRA (only to find that TMM normalized data looks nearly identical to log2-CPM data). Details on what I did are in the [final submission}(https://github.com/bcb420-2025/Izumi_Ando/blob/main/A1_Izumi_Ando/Izumi_Ando.html).

Day 5 - Feb 12th, 2025 (extension day)

⏰ : 16:00~19:50

Overview of the things that I did today

  • Went back in and fixed up some formatting
  • Added in comments fixed some spelling errors
  • Made sure the Rmd built properly by opening a new docker container - I actually caught one small error where a package could not be installed from CRAN (resolved when switched to BioConductor)
  • Adding in an internal navigation / link system with the following code
<a id="some-id"></a> 
[whatever text](#some-id)
⚠️ **GitHub.com Fallback** ⚠️