Assignment 1: Data set selection and initial Processing - bcb420-2023/Metyu_Melkonyan GitHub Wiki
Part 1:Finding the Expresison Dataset from GEO
Objective
- To find RNA expreression dataset for the part 1 of Assignment 1
- To get familiarize with the RNA expression dataset
- Explore the dataset to see non-redundant genes and many more genes that can be assocaited with other cancer types
- Do further research on pancreas cancer
- Do further research on RNA-seq and other methods to quantify the gene expression of gene sets
Duration
Time estimated : 2 hours
Time took: 3 hours
Date started: 2023-02-02
Completed: 2023-02-02
Micro Array explanation: Using a prope and hybdidization. Later image analysis allows for expression measurement. chip. The illunation method is sued for measure the fleurescent.
RNA seq:
- RNA seq sampling
- RNA extraction target enrichment analysis
- Freamentation of the RNA molecules and cDNA library assembly.
- Sequencing and FASTAQ file generation
- Transcriptome mapping via using the sequncing data.
- Bioinformatics: Differential expression analysis, variant alling analysis, annotation, novel transcription discover and RNA editing via using different computational methods.
Bulk RNASeq:
- We are using Bulk RNA seq for this assignment because it has a small size and using HTSeq raw count which faciliates the normalization process *The GEO database is used to retrieve the gene expression data along with GEOmetadb has been used. *SQLite has been used to retrieve information from the GEOmetadatabase
Conclusion
- The template code was structured based on the query search
- The potential data set for GSE164730 is used and found
- I got more familiarize with RNA-seq procedure as well as the methods
- Other research in Pancreas cancer and different cancer's is promising.
if(!file.exists('GEOmetadb.sqlite'))
GEOmetadb::getSQLiteFile()
con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite')
Geo_tables <- DBI::dbListTables(con)
Geo_tables
results <- DBI::dbGetQuery(con,'select * from gpl limit 5')
knitr::kable(head(results[,1:10]), format = "html")
sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2014-01-01' AND",
" gse.title LIKE '%Cancer%' AND",
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",
" ORDER BY gse.submission_date DESC",sep=" ")
result_query <- DBI::dbGetQuery(con,sql)
Part 2 Normalization & Data Cleaning.
Objective
- To clean the gene expression dataset of GSE131222
- To normalize the expression values
- Analyze the normalization values
- Validate if the normalization values make sense
- Further match tne nromalization values with the results that you obtain from the final concluding analysis (Does it make sense!)
Duration
Time Estimated 3 hours Time taken: 5 hours Date started: 2023-02-08 Completed: 2023-02-12
Conclusion
- Different data sets were cleaned
- Replicated gene expression rows were elimianted
- Normalized used to see the difference between unreplicated and replicated values.
- The normalization values make sense, due to it's consensus with the expected analysis result (Next part validates it !)
Part 3 Interpretation of The Expression Data
Objectives
- To interpret data and have an understanding of what the actual case study is conducting
- Use HUGO symbols provided by the dataset to sort and map the indentifiers
- Prevent any inconsistencies within data that can be generated via HUGO symbol covnersion
- Find the difference between normalized and the converted data
- Analyze the HUGO symbols if they make sense, and if they correlate with the previous symbols
Duration
Time estimate 4 hours Time Taken: 8 hours Completed: around the same time of the submission Start date: Unknown (Please time yourself next time!)
Conclusion
- The data has been mapped with the correct identifier with normalized values
- Normalized and clean data has been visualzied using different plots
- The divergence and variance have been shown and calculated
- Importantly!! (Attention) The HUGO symbols are validated by matching the HUGO symbol data, this allowed me to both validate the identity of the HUGO symbols that I have at the end. This is important !
References
Adams, C. R., Htwe, H. H., Marsh, T., Wang, A. L., Montoya, M. L., Subbaraj, L., Tward, A. D., Bardeesy, N., & Perera, R. M. (2019). Transcriptional control of subtype switching ensures adaptation and growth of pancreatic cancer. ELife, 8. https://doi.org/10.7554/elife.45313
Bioconductor - home. (n.d.). Bioconductor.org. Retrieved February 13, 2023, from https://www.bioconductor.org/
EdgeR. (n.d.). Bioconductor. Retrieved February 13, 2023, from https://bioconductor.org/packages/release/bioc/html/edgeR.html
Ensembl genome browser 109. (n.d.). Ensembl.org. Retrieved February 12, 2023, from http://useast.ensembl.org/index.html
GEO overview. (n.d.). Nih.gov. Retrieved February 13, 2023, from https://www.ncbi.nlm.nih.gov/geo/info/overview.html
National center for biotechnology information. (n.d.). Nih.gov. Retrieved February 12, 2023, from https://www.ncbi.nlm.nih.gov/
Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. CRC Press.