Assignment #1 ‐ Data set selection and initial Processing - bcb420-2024/Anna_Lai GitHub Wiki

Data set selection and initial Processing

Date February 13, 2024

Notes that are not included in the final submission. For a coherent story of the data, please refer to the notebook created.

Data Selection

Tried running the code to receive data in R and query in R. Yet it seems like the R package service of GEOquery Bioconductor is obsolete.

The following code didn't work.

url <- "https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html"
destination <- "GEOmetadb.html"

download.file(url, destfile = destination, method = "auto", quiet = FALSE)

If connected to the GEOmetadb, the following code can be used, followed by sql statement.

if( !file.exists("GEOmetadb.sqlite") ) {
    demo_sqlfile <- getSQLiteFile(destdir = getwd(), destfile = "GEOmetadb.sqlite.gz", type = "normal")
} else {
    demo_sqlfile <- "GEOmetadb.sqlite"
}

con <- dbConnect(SQLite(), demo_sqlfile)

Error code produced:

trying URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html'
Warning: downloaded length 0 != reported length 0Warning: cannot open URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html': HTTP status was '404 Not Found'Error in download.file(url, destfile = destination, method = "auto", quiet = FALSE) : 
  cannot open URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html'

Hence Online Query was made and used to select the data. Please refer to the Journal Data Selection.

Data cleaning and HUGO mapping

Error encountered.

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [useast.ensembl.org:443] Operation timed out after 60000 milliseconds with 0 bytes received

The broken connection to Ensembl site cannot be resolved see below.

`ids2convert <- counts_data$Gene.ID

mart = useEnsembl("ENSEMBL_MART_ENSEMBL") biomart=useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") ensembl <- useEnsembl(biomart, dataset,host="www.ensembl.org", version = NULL, GRCh = NULL, mirror = NULL, verbose = FALSE)

ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)

save the file because it is computationally intensive.

conversion_stash <- "id_conversion.rds"

if(file.exists(conversion_stash)){ id_conversion <- readRDS(conversion_stash) } else { id_conversion <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"), filters = c("ensembl_gene_id"), values = ids2convert, mart = ensembl ) saveRDS(id_conversion, conversion_stash) }`

error message

'Ensembl site unresponsive, trying asia mirror Ensembl site unresponsive, trying useast mirror Error in .chooseEnsemblMirror(mirror = mirror, httr_config = httr_config) : Unable to query any Ensembl site'

Hence a different R library was used.

The data is already presented with HUGO gene labels. Please refer to the details in the Rnotebook.

Data Normalization

Please refer to the details in the Rnotebook.

Link to Assignment 1

The R file: https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.Rmd

The HTML filed: https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.html

To view rendered HTML: https://html-preview.github.io/?url=https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.html

Citations

Dorison A, Ghobrial I, Graham A, Peiris T et al. Kidney Organoids Generated Using an Allelic Series of NPHS2 Point Variants Reveal Distinct Intracellular Podocin Mistrafficking. J Am Soc Nephrol 2023 Jan 1;34(1):88-109. PMID: 36167728