Assignment 1 - bcb420-2025/Keren_Zhang GitHub Wiki

Table of Contents

Goal

Setting up account and Docker for future tasks.

Time

Date: Feb 7th -Feb 11th, 2025

Estimated Time: 20 hour

Time Taken: 35 hours

Data Selection

Selection Criteria

  • Must be RNASeq data
  • Must be associated with a publication
  • Must be able to be retrieved computationally
  • Must have good coverage
  • Must not be older than ten years (quality!)
  • Must have sufficient numbers of replicates
  • Must be collected under interesting conditions
  • Contains at least 2 classes, or conditions to enable differential expression calculation
  • Must be mapped to unique human gene identifiers

Search Parameters

The topic that I am interested in is cancer, specifically, Non-small cell lung cancer (NSCLC). There are multiple different types of NSCLC, but I decided to focus on Lung Adenocarcinoma. Using suggestion from the lecture, I searched the Gene Expression Omnibus (GEO) for related datasets with the following search details:

(("adenocarcinoma of lung"[MeSH Terms] OR Lung Adenocarcinoma[All Fields]) AND ("rna"[MeSH Terms] OR RNA[All Fields])) AND ("Homo sapiens"[Organism] AND "Expression profiling by high throughput sequencing"[Filter] AND "published last year"[Filter] AND "TXT"[Supplementary Files])

Chosen Dataset

There were 3 dataset that I was interested in:


Accession GSE287504: Although it is a very interesting study that involves bulk RNA sequencing of sorted B cells from primary lung cancer, it unfortunately does not have a related publication. Thus, it has to be passed.

Accession GSE270310: This study was listed in the search results, but it is not really related to lung cancer.

Accession GSE276387: This study is selected as the final dataset as it successfully meets the above criteria, with the following characteristics:

  • Is RNASeq data and contains raw counts
  • Is associated with a [https://pmc.ncbi.nlm.nih.gov/articles/PMC11513837/#ad93 publication]
  • Raw Counts could be manually downloaded from GEO or retrieved using R package
  • Has good coverage, as describe in "RNA-seq and data analysis" in the Methods section
  • Less than 10 years old: published Oct, 2024
  • Contains 17 Patient Derived Organoid samples, 5 Metastasis Derived Organoid samples, and 4 Human Tumor Samples.
  • Maps to unique human gene identifiers
  • Is homo sapiens
  • Expression profiling by high throughput sequencing (Illumina HiSeq 2500, Illumina NovaSeq 6000)
  • Models Lung Adenocarcinoma Metastases, as per my interest

Docker Modification

To keep the environment consistent and reproducible, I added the following line the to Dockerfile to install the GEOQuery package.

RUN R -e 'BiocManager::install("GEOquery")'

I then rebuilt the docker image using the following command:

docker build -t bcb420:version-2 . 
Docker

When trying to run the docker image with

docker run -e PASSWORD=changeit   -v "$(pwd)":/home/rstudio/projects   -p 8787:8787   bcb420:version-2
, I encountered the following issue related to the port already being used: Error

I had to check docker processes with docker ps and manually stop the previous container with docker stop

Then the problem is fixed.

Data Import

Before the import begins, these are the R packages that I used for this assesment:

Libraries

library(GEOquery)
library(knitr)
library(ggplot2)
library(edgeR)
library(biomaRt)
library(dplyr)
library(tidyr)
library(reshape2)

Download

Then, I used the getGEO function from the GEOquery package to fetch the dataset associated with the provided GEO Series ID. The function is set with GSEMatrix=FALSE to retrieve the full GEO accession object, not just the expression data matrix.

data_set_geoid <- "GSE276387" 

# Obtain Data from GEO
gse <- getGEO(data_set_geoid ,GSEMatrix=FALSE)

#Sumamry of Dataset
gse@header$summary

Platforms Information:

  • Determining the number of platforms associated with the dataset using GPLList(gse) and prints this information.
  • Retrieves the IDs of the first and second platforms and prints their titles, last update dates, and the organisms they are based on.
# Check the number of platforms found
number_of_platforms <- length(names(GPLList(gse)))
print(paste("Number of platforms available:", number_of_platforms))

# Retrieve first platform IDs (GPL) associated with the GSE
gpl_1 <- names(GPLList(gse))[[1]]

print("First Platform:")
#Retrieve detailed information about the platform used in the GEO series
gpl_1_info <- Meta(getGEO(gpl_1))

gpl_1_info$title

gpl_1_info$last_update_date

gpl_1_info$organism


# Retrieve second platform IDs (GPL) associated with the GSE
gpl_2 <- names(GPLList(gse))[[2]]

print("Second Platform:")
#Retrieve detailed information about the platform used in the GEO series
gpl_2_info <- Meta(getGEO(gpl_2))

gpl_2_info$title

gpl_2_info$last_update_date

gpl_2_info$organism

Supplementary File:

  • Lists the names of supplementary files associated with the dataset. It highlights that there is one key supplementary file containing raw counts from an RNASeq experiment.
#get the names of the supplementary files
sfilenames = getGEOSuppFiles(data_set_geoid, fetch_files = FALSE)
sfilenames$fname

Data Assessment

To download the expression dataset:

#Download to current directory
download_dir <- file.path(getwd())

# Check to see if the file exists already before you download them
# Only download files that we don't have from the set of supplementary files

# Identify missing files based on their presence in the download directory
missing_files <- sfilenames$fname[!unlist(
  lapply(sfilenames$fname, function(x) {
    file.exists(file.path(download_dir, data_set_geoid, x))
  })
)]

# Download missing files
if (length(missing_files) > 0) {
  for (i in 1:length(missing_files)) {
    # Get the supplementary files from GEO
    sfiles <- getGEOSuppFiles(data_set_geoid,
                              filter_regex = missing_files[i],
                              baseDir = download_dir,
                              fetch_files = TRUE)
  }
}

Read RNA-seq Data

This script segment is to load RNA sequencing data from a file within a specified directory. The data is expected to be in a table format, typically outputted from RNA sequencing analysis pipelines:

luad_rnaseq_data <- read.table(
  file.path(download_dir, data_set_geoid, data_filename),  # Construct the full path to the file
  header = TRUE,   # Specify that the first row of the file contains column headers
  check.names = TRUE  # Ensure that column names are valid R names and adjust if necessary
)

Then, I showed the dimensions of the loaded data frame, providing a quick overview of the dataset's size. I also utilized the kable function from the knitr package to generate an HTML-formatted table of the first 7 rows and columns of the dataset for a concise preview:

# Display the dimensions of the luad_data data frame to understand its structure
dim(luad_rnaseq_data)

# Generate an HTML-formatted table using 'kable' from the 'knitr' package
# Only look at the first 7 rows and columns
kable(luad_rnaseq_data[1:7,1:7], format = "html")

image

Additional Information:

The data contains 57445 rows genes which sounds like a good length for RNASeq data. There are a total number of 26 samples in the dataset:

  • LUAD refers to the 4 Human Tumor samples
  • PDO without MDO refers to the 17 Patient Derived Organoid samples
  • PDO23_MDO refers to the 5 Metastasis Derived Organoid samples
nrow(luad_rnaseq_data)

# Extract all sample information from the GSE object
list_of_samples <- gse@gsms

# Create a data frame containing the title and characteristics of each sample
samples_type <- do.call(rbind,
    lapply(list_of_samples, function(x) {
        c(x@header$title, x@header$characteristics_ch1)
    })
)

head(samples_type, 3)

# convert to a data frame
sample_type_df <- as.data.frame(samples_type, stringsAsFactors = FALSE)

colnames(sample_type_df) <- c("Sample Type", "Tissue Type", "Cell Line", "Cell Type", "Batch")

sample_type_df[,'Tissue Type'] <- gsub(sample_type_df[,'Tissue Type'],
                                             pattern = "tissue: ",
                                             replacement = "")

sample_type_df[,'Cell Line'] <- gsub(sample_type_df[,'Cell Line'],
                                             pattern = "cell line: ",
                                             replacement = "")

sample_type_df[,'Cell Type'] <- gsub(sample_type_df[,'Cell Type'],
                                             pattern = "cell type: ",
                                             replacement = "")

sample_type_df[,'Batch'] <- gsub(sample_type_df[,'Batch'],
                                             pattern = "batch: ",
                                             replacement = "")

knitr::kable(sample_type_df, format = "html", align = "c")

colnames(luad_rnaseq_data)[1:28]

Data Mapping

Fortunately, for the luad_rnaseq_data, it appears that the SYMBOL column is already mapped to gene symbols, which are often referred to as HUGO Gene Nomenclature Committee (HGNC) symbols.

To verify that this is the case, I double checked the first five rows in the luad_rnaseq_data manually:

# Extract the first three Ensembl IDs from row names
ensembl_ids <- (luad_rnaseq_data$ENSEMBL)[1:5]
# Remove any version numbers from the Ensembl IDs
ensembl_ids <- sub("\\..*$", "", ensembl_ids)

# Connect to the Ensembl database
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Get gene symbols using the cleaned Ensembl IDs
gene_info <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol'), 
                   filters = 'ensembl_gene_id', 
                   values = ensembl_ids, 
                   mart = ensembl)

# Print the retrieved gene information
print(gene_info)

image

Comparing this to the 'kable' above, we can see that the genes are the same for first 5 rows, suggesting that they are already mapped.

Data Cleaning

Check if there are expressions not mapped to a SYMBOL:

# Check for empty strings or NA values in the SYMBOL column
missing_symbols <- luad_rnaseq_data[luad_rnaseq_data$SYMBOL == "" | is.na(luad_rnaseq_data$SYMBOL), ]

# Print the rows with missing SYMBOLs
print(missing_symbols)

# Count the number of genes with missing SYMBOLs
num_missing_symbols <- nrow(missing_symbols)
print(paste("Number of genes with missing SYMBOLs:", num_missing_symbols))

Before performing further evolutions, I removed duplicate genes from the dataset, and shorten the dataset to 5 PDOs only.

# Remove duplicates and keep the first occurrence
luad_rnaseq_data_unique <- luad_rnaseq_data %>%
  distinct(SYMBOL, .keep_all = TRUE)

# Display the dimensions to understand the updated structure
dim(luad_rnaseq_data_unique)
columns_to_keep <- c("ENSEMBL", "SYMBOL", "LUAD16T", "LUAD18T", "LUAD1T", "LUAD23T",
                     "PDO1", "PDO2", "PDO4", "PDO5", "PDO6", "PDO23_MDO1_DIAPH", "PDO23_MDO1_Liver", "PDO23_MDO2_Brain", "PDO6_MDO_Brain", "PDO23_MDO3_GB")

# Subset the data frame
luad_rnaseq_subset <- luad_rnaseq_data_unique[, columns_to_keep]

# Check the structure of the new data frame
dim(luad_rnaseq_subset)

Distribution and Density before Cleaning

image image

Outliers Analysis

I used the interquartile range (IQR) to identify and potentially remove outliers. It works by calculating the difference between the 75th (upper quartile) and 25th (lower quartile) percentiles of a data set, measuring the spread of the middle 50% of the data. While not always the best choice for complex statistical analyses, it can be highly beneficial for preliminary analysis to quickly identify outliers and understand the distribution of data.

# Calculate the first and third quantiles (25th and 75th percentiles), and the IQR for each sample (column)
quantile_1 <- apply(luad_data_log2, 2, quantile, 0.25)
quantile_3 <- apply(luad_data_log2, 2, quantile, 0.75)
iqr <- quantile_3 - quantile_1

# Calculate the lower and upper bounds for identifying outliers in each sample
lower_bound <- quantile_1 - (1.5 * iqr)
upper_bound <- quantile_3 + (1.5 * iqr)

# Check if data is within the outlier bounds for each sample
within_range <- luad_data_log2 >= lower_bound & luad_data_log2 <= upper_bound

# Calculate the percentage of data points in each sample that are not outliers
outlier_percent <- (dim(luad_data_log2)[1] - colSums(within_range)) / dim(luad_data_log2)[1]

# Create a data frame with the lower bound, upper bound, and outlier percentage for easier interpretation
outlier_analysis <- data.frame(
  Lower_Bound = lower_bound,
  Upper_Bound = upper_bound,
  Outlier_Percent = outlier_percent
)

# Print the results for review
print(outlier_analysis)


# Convert the row names to a column for ggplot
outlier_analysis$Sample <- rownames(outlier_analysis)

# Create a bar plot for outlier percentages
ggplot(outlier_analysis, aes(x = Sample, y = Outlier_Percent, fill = Outlier_Percent)) +
  geom_bar(stat = "identity") +
  labs(title = "Outlier Percentages by Sample",
       x = "Sample",
       y = "Percentage of Outliers") +
  scale_fill_gradient(low = "blue", high = "red") + # Color gradient from low to high percentage
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

image

Data Normalization

Visualizing Distribution

After Outlier Removal:

image

image

After TMM method Using edgeR

# Extracting the group information for cell type from luad_data_numeric
groups <- as.factor(ifelse(grepl("LUAD", colnames(luad_data_numeric)), "LUAD",
                  ifelse(grepl("^PDO[0-9]+$", colnames(luad_data_numeric)), "PDO",
                         ifelse(grepl("PDO", colnames(luad_data_numeric)), "PDO_MDO", NA))))


# Create DGEList from your data matrix
dge <- DGEList(counts = luad_data_numeric, group=groups)

# Filter out lowly expressed genes
keep <- filterByExpr(dge)
dge <- dge[keep,]

# Calculate normalization factors
dge <- calcNormFactors(dge, method = "TMM")

# Calculate raw CPM
cpm_data <- cpm(dge)


# Viewing the first few rows of the normalized data
head(cpm_data)

# Adding 1 to the counts to avoid log of zero if not already done
normalized_luad_log2 <- log2(cpm_data + 1)

image

image

image

Discussion

Why is the dataset of interest to you?

I chose this dataset because I was always interested in cancer research. Also, I had been working with LUAD relate projects in the past and I want to continue the exploration.

What are the control and test conditions of the dataset?

PDOs and Metastasis-derived Organoids (MDOs)are primarily the test groups. They are derived from patient tumors. The Human Tumor Samples, or, original tumor samples from which PDOs are derived serve as the control.

How many samples in each of the conditions of your dataset?

There are 17 Patient Derived Organoid samples, 5 Metastasis Derived Organoid samples, and 4 Human Tumor Samples. I subsetted the data for preliminary analysis so in my assessment I included 5 Patient Derived Organoid samples, 5 Metastasis Derived Organoid samples, and 4 Human Tumor Samples.

Were there expression values that were not unique for specific genes? How did you handle these?

Upon checking for duplicates based on the SYMBOL column, I identified 2,322 duplicated entries, indicating that some genes had multiple occurrences in the dataset. Using the distinct() function from the dplyr package, the first occurrence of each unique gene symbol is kept. After filtering, the dataset was reduced to 55,401 unique genes.

Were there expression values that could not be mapped to current HUGO symbols?

No, there were no expression values that could not be mapped to current HUGO symbols. The dataset provided by the study already mapped all expression to their symbols and there does not seem to be any empty ones as per my analysis.

Were there any outliers in your dataset? How were they handled in the originating paper? How many outliers were removed?

In the original paper, the authors do not mention whether the outliers were removed, not the way they were handled. I identified outliers sing the IQR method. I applied a masking approach, replacing values outside the acceptable range with NA. After filtering, the mean and median expression levels were recalculated. A total of 36810 outlier feilds are removed.

How did you handle replicates?

There are no technical replicates are all samples are the same run. However, there are multiple biological replicates in each conditions. Biological replicates were grouped together for analysis using edgeR.

What is the final coverage of your dataset?

The final coverage is 17195 genes over 14 samples.

⚠️ **GitHub.com Fallback** ⚠️