Assignment 1 Journal - bcb420-2023/Angela_Uzelac GitHub Wiki

Beginning workflow

open powershell
run docker
- bonus: make easier script to run docker
go to localhost::8787
make R script called exploreGEOdatasets
write commands for installed GEOmetadb if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") if (!requireNamespace("GEOmetadb", quietly = TRUE)) BiocManager::install("GEOmetadb")
get the metadata

if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()

trying URL 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz' Warning in download.file(url_geo, destfile = localfile, mode = "wb") : cannot open URL 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz': HTTP status was '404 Not Found' Error in download.file(url_geo, destfile = localfile, mode = "wb") : cannot open URL 'http://starbuck1.s3.amazonaws.com/sradb/GEOmetadb.sqlite.gz'

need R version 4.2.2 I have 4.2.1, i don't think this is the problem need GEOmetadb version 1.6, i have version 1.56 to bypass this faulty URL, I went to https://gbnci.cancer.gov/geo/GEOmetadb.sqlite.gz this downloaded the zipped file now put it in working directory and unzip

set working directory as projects then accessing parent directory in a path is "./"

Pick dataset

GEOquery::gunzip("./GEOmetadb.sqlite.gz", destname = "./GEOmetadb.sqlite")

gpl: geo platform

goal: Let’s look for datasets that are: RNASeq data recent - within the last few years human minimum number of samples

try: SQL statement to get datasets that are RNASeq data, human, within 5 years, related to ovarian (change this), the supplementary file is counts: sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,", " gse.submission_date,", " gse.supplementary_file", #<< "FROM", " gse JOIN gse_gpl ON gse_gpl.gse=gse.gse", " JOIN gpl ON gse_gpl.gpl=gpl.gpl", "WHERE", " gse.submission_date > '2017-01-01' AND", #change date " gse.title LIKE '%ovarian%' AND", " gpl.organism LIKE '%Homo sapiens%' AND", " gpl.technology LIKE '%high-throughput sequencing%' ", " ORDER BY gse.submission_date DESC",sep=" ") get the data by passing the query: rs <- dbGetQuery(con,sql) dim(rs)

Final code for searching for dataset

if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } if (!requireNamespace("GEOmetadb", quietly = TRUE)){ BiocManager::install("GEOmetadb") }

GEOquery::gunzip("./GEOmetadb.sqlite.gz", destname = "./GEOmetadb.sqlite") if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile() file.info('GEOmetadb.sqlite')

library(DBI) con <- dbConnect(RSQLite::SQLite(), "GEOmetadb.sqlite")

(geo_tables <- dbListTables(con))

dbListFields(con,'gse')

#change what dataset is related to (gse title) sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,", " gse.submission_date,", " gse.supplementary_file", "FROM", " gse JOIN gse_gpl ON gse_gpl.gse=gse.gse", " JOIN gpl ON gse_gpl.gpl=gpl.gpl", "WHERE", " gse.submission_date > '2018-01-01' AND", " gse.title LIKE '%schizophrenia%' AND", " gpl.organism LIKE '%Homo sapiens%' AND", " gpl.technology LIKE '%high-throughput sequencing%' ", " ORDER BY gse.submission_date DESC",sep=" ")

rs <- dbGetQuery(con,sql)

counts_files <- rs$supplementary_file[grep(rs$supplementary_file, pattern = "count|cnt",ignore.case = TRUE)]

series_of_interest <- rs$gse[grep(rs$supplementary_file, pattern = "count|cnt",ignore.case = TRUE)]

rs <- rs[rs$gse %in% series_of_interest, ]

shortened_filenames <- unlist(lapply(counts_files, FUN = function(x){x <- unlist(strsplit(x,";")) ; x <- x[grep(x,pattern= "count|cnt",ignore.case = TRUE)]; `tail(unlist(strsplit(x,"/")),n=1)}))``

get the data (change GSE number): ``sfiles = getGEOSuppFiles('GSE153970') fnames = rownames(sfiles)` `b2 = read.delim(fnames[1], header = TRUE)` `head(b2)`

Note: manually downloaded file and unzipped in the code instead of using the code from class because it didn't work

General Information

my dataset: GSE112523 associated paper: (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6499808/) how is the data from the dataset related to the article? Raw and processed data for data generated in this work have been deposited at the Gene Expression Omnibus under the SuperSeries accession number GSE112525. These include subseries for human DNA methylation arrays (GSE112179), RNA-sequencing (GSE112523), bisulfite targeted sequencing (GSE112524), and genotyping arrays (GSE113093), and transcriptome profiling of mouse brains (GSE120423). the RNA sequencing data is what I use

requirement: The experiments should be performed with biological replicates (the more the better) how do i know that my experiment has a good number of replicates? it has 34. 17 control, 10 bipolar, 7 schizo smallest number of replicates is 7 in lecture said atleast 6 total replicates so I'm good

i grouped my datasets by patient and disease kind of like in class but instead of doing this first check if there is any meta data

next steps: normalize plot MDS (include in report) BCV plot? map ID hgnc_symbol and ensembl_gene_id in biomaRt (as attributes to return) be careful: I might have ensembl_transcript_id instead of ensembl_gene_id, maybe even version numbers (as filter) - NO they are ENSG

Converting IDs

What are we trying to convert: Human Ensembl Gene Ids to HGNC symbols. A few things we need:

attributes - the type of identifiers that you want to retrieve. So in our use case this is Ensembl gene Ids and the HGNC symbols - What would happen if we just specified the HGNC symbols here?
filters - how you are filtering your results. Not intuitive. For our use case this is Ensembl gene ids
values - the values associated with the filters. So in our use case this is the set of Ensembl gene ids that we have

conversion_stash: storing data first time through, if we already have conversion then don't need to go through ensembl to get it again

check how we did - don't just wanna check difference in number of ids because it doesnt mean that we are missing only that many. merge counts by conversion (2 columns) with normalized counts. merge first column of the ca125_id_conversion data with the row names of normalized counts, and include everything in y (we're adding the hgnc column

normalized_counts_annot <- merge(ca125_id_conversion,normalized_counts, by.x = 1, by.y = 0, all.y = TRUE)

want to see how many rows are missing symbols ensembl_id_missing_gene <- normalized_counts_annot$ensembl_gene_id[which(is.na(normalized_counts_annot$hgnc_symbol))] length(ensembl_id_missing_gene) if it's small percentage missing then doesn't matter for analysis if it is missing alot then have to investigate

Associated paper

this data: looked at different gene counts (due to DNA methylation) in normal vs bp vs sch?? they normalized w TMM determined the consequences of altered DNA methylation in major psychosis by profiling transcriptomes in a randomly selected subset of the same samples, by RNA sequencing

"In this work, we perform a genome-wide comparison of DNA methylation in isolated neurons from the frontal cortex of individuals with schizophrenia and bipolar disorder, to those in undiagnosed individuals." So comparing them together against normal really looking at increased dopamine causing major psychosis

Normalization

The total RNA production, Sk, cannot be estimated directly, since we do not know the expression levels and true lengths of every gene. However, the relative RNA production of two samples, fk=Sk/Sk’, essentially a glo- bal fold change, can more easily be determined. We pro- pose an empirical strategy that equates the overall expression levels of genes between samples under the assumption that the majority of them are not DE. One simple yet robust way to estimate the ratio of RNA pro- duction uses a weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM))

The Trimmed Mean of the M-values (TMM) [7] ap- proach is to choose a sample as a reference sample, then calcu- late fold changes and absolute expression levels relative to that sample. The genes are trimmed twice by these two values, to re- move DE genes, then the trimmed mean of the fold changes is found for each sample. Read counts are scaled by this trimmed mean and the total count of their sample.

the technical variation affected the control and disesase samples in the same way This generates MA plot, most of genes are clustered on mean line Then a few genes that show differential expression that are off the mean line

R Markdown

echo = FALSE so that the code doesn't show up, only result eval = FALSE so that the result doesn't show up, only the code only use r x for inline code like {r eval = FALSE} x

How to submit

go to home of my github repository
click "add file" drop down
select "upload files"
navigate to the location where the assignment R Notebook is found on my actual local machine
choose the assignment .html file
go to where that file is now in the github and copy the path to it (URL)
paste that URL to the student wiki in the same row where my name is