Assignment #1 ‐ Data set selection and initial Processing - bcb420-2024/Anna_Lai GitHub Wiki
Data set selection and initial Processing
Date February 13, 2024
Notes that are not included in the final submission. For a coherent story of the data, please refer to the notebook created.
Data Selection
- Tried running the code to receive data in R and query in R. Yet it seems like the R package service of GEOquery Bioconductor is obsolete.
The following code didn't work.
url <- "https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html"
destination <- "GEOmetadb.html"
download.file(url, destfile = destination, method = "auto", quiet = FALSE)
If connected to the GEOmetadb, the following code can be used, followed by sql statement.
if( !file.exists("GEOmetadb.sqlite") ) {
demo_sqlfile <- getSQLiteFile(destdir = getwd(), destfile = "GEOmetadb.sqlite.gz", type = "normal")
} else {
demo_sqlfile <- "GEOmetadb.sqlite"
}
con <- dbConnect(SQLite(), demo_sqlfile)
Error code produced:
trying URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html'
Warning: downloaded length 0 != reported length 0Warning: cannot open URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html': HTTP status was '404 Not Found'Error in download.file(url, destfile = destination, method = "auto", quiet = FALSE) :
cannot open URL 'https://bioconductor.org/packages/release/data/annotation/html/GEOmetadb.html'
Hence Online Query was made and used to select the data. Please refer to the Journal Data Selection.
Data cleaning and HUGO mapping
Error encountered.
Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [useast.ensembl.org:443] Operation timed out after 60000 milliseconds with 0 bytes received
The broken connection to Ensembl site cannot be resolved see below.
`ids2convert <- counts_data$Gene.ID
mart = useEnsembl("ENSEMBL_MART_ENSEMBL") biomart=useMart(biomart="ensembl", dataset="hsapiens_gene_ensembl") ensembl <- useEnsembl(biomart, dataset,host="www.ensembl.org", version = NULL, GRCh = NULL, mirror = NULL, verbose = FALSE)
ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
save the file because it is computationally intensive.
conversion_stash <- "id_conversion.rds"
if(file.exists(conversion_stash)){ id_conversion <- readRDS(conversion_stash) } else { id_conversion <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"), filters = c("ensembl_gene_id"), values = ids2convert, mart = ensembl ) saveRDS(id_conversion, conversion_stash) }`
error message
'Ensembl site unresponsive, trying asia mirror Ensembl site unresponsive, trying useast mirror Error in .chooseEnsemblMirror(mirror = mirror, httr_config = httr_config) : Unable to query any Ensembl site'
Hence a different R library was used.
The data is already presented with HUGO gene labels. Please refer to the details in the Rnotebook.
Data Normalization
Please refer to the details in the Rnotebook.
Link to Assignment 1
The R file: https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.Rmd
The HTML filed: https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.html
To view rendered HTML: https://html-preview.github.io/?url=https://github.com/bcb420-2024/Anna_Lai/blob/main/A1_AnnaLai.html
Citations
- Dorison A, Ghobrial I, Graham A, Peiris T et al. Kidney Organoids Generated Using an Allelic Series of NPHS2 Point Variants Reveal Distinct Intracellular Podocin Mistrafficking. J Am Soc Nephrol 2023 Jan 1;34(1):88-109. PMID: 36167728