Assignment 1: Data set selection and initial Processing - bcb420-2023/Metyu_Melkonyan GitHub Wiki

Part 1:Finding the Expresison Dataset from GEO

Objective

  • To find RNA expreression dataset for the part 1 of Assignment 1
  • To get familiarize with the RNA expression dataset
  • Explore the dataset to see non-redundant genes and many more genes that can be assocaited with other cancer types
  • Do further research on pancreas cancer
  • Do further research on RNA-seq and other methods to quantify the gene expression of gene sets

Duration

Time estimated : 2 hours Time took: 3 hours Date started: 2023-02-02
Completed: 2023-02-02

Micro Array explanation: Using a prope and hybdidization. Later image analysis allows for expression measurement. chip. The illunation method is sued for measure the fleurescent.

RNA seq:

  • RNA seq sampling
  • RNA extraction target enrichment analysis
  • Freamentation of the RNA molecules and cDNA library assembly.
  • Sequencing and FASTAQ file generation
  • Transcriptome mapping via using the sequncing data.
  • Bioinformatics: Differential expression analysis, variant alling analysis, annotation, novel transcription discover and RNA editing via using different computational methods.

Bulk RNASeq:

  • We are using Bulk RNA seq for this assignment because it has a small size and using HTSeq raw count which faciliates the normalization process *The GEO database is used to retrieve the gene expression data along with GEOmetadb has been used. *SQLite has been used to retrieve information from the GEOmetadatabase

Conclusion

  • The template code was structured based on the query search
  • The potential data set for GSE164730 is used and found
  • I got more familiarize with RNA-seq procedure as well as the methods
  • Other research in Pancreas cancer and different cancer's is promising.
if(!file.exists('GEOmetadb.sqlite')) 
  GEOmetadb::getSQLiteFile()
con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite')
Geo_tables <- DBI::dbListTables(con)
Geo_tables
results <- DBI::dbGetQuery(con,'select * from gpl limit 5')
knitr::kable(head(results[,1:10]), format = "html")

sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2014-01-01' AND",
" gse.title LIKE '%Cancer%' AND", 
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",
" ORDER BY gse.submission_date DESC",sep=" ")
result_query <- DBI::dbGetQuery(con,sql)

Part 2 Normalization & Data Cleaning.

Objective

  • To clean the gene expression dataset of GSE131222
  • To normalize the expression values
  • Analyze the normalization values
  • Validate if the normalization values make sense
  • Further match tne nromalization values with the results that you obtain from the final concluding analysis (Does it make sense!)

Duration

Time Estimated 3 hours Time taken: 5 hours Date started: 2023-02-08 Completed: 2023-02-12

Conclusion

  • Different data sets were cleaned
  • Replicated gene expression rows were elimianted
  • Normalized used to see the difference between unreplicated and replicated values.
  • The normalization values make sense, due to it's consensus with the expected analysis result (Next part validates it !)

Part 3 Interpretation of The Expression Data

Objectives

  • To interpret data and have an understanding of what the actual case study is conducting
  • Use HUGO symbols provided by the dataset to sort and map the indentifiers
  • Prevent any inconsistencies within data that can be generated via HUGO symbol covnersion
  • Find the difference between normalized and the converted data
  • Analyze the HUGO symbols if they make sense, and if they correlate with the previous symbols

Duration

Time estimate 4 hours Time Taken: 8 hours Completed: around the same time of the submission Start date: Unknown (Please time yourself next time!)

Conclusion

  • The data has been mapped with the correct identifier with normalized values
  • Normalized and clean data has been visualzied using different plots
  • The divergence and variance have been shown and calculated
  • Importantly!! (Attention) The HUGO symbols are validated by matching the HUGO symbol data, this allowed me to both validate the identity of the HUGO symbols that I have at the end. This is important !

References

Adams, C. R., Htwe, H. H., Marsh, T., Wang, A. L., Montoya, M. L., Subbaraj, L., Tward, A. D., Bardeesy, N., & Perera, R. M. (2019). Transcriptional control of subtype switching ensures adaptation and growth of pancreatic cancer. ELife, 8. https://doi.org/10.7554/elife.45313

Bioconductor - home. (n.d.). Bioconductor.org. Retrieved February 13, 2023, from https://www.bioconductor.org/

EdgeR. (n.d.). Bioconductor. Retrieved February 13, 2023, from https://bioconductor.org/packages/release/bioc/html/edgeR.html

Ensembl genome browser 109. (n.d.). Ensembl.org. Retrieved February 12, 2023, from http://useast.ensembl.org/index.html

GEO overview. (n.d.). Nih.gov. Retrieved February 13, 2023, from https://www.ncbi.nlm.nih.gov/geo/info/overview.html

National center for biotechnology information. (n.d.). Nih.gov. Retrieved February 12, 2023, from https://www.ncbi.nlm.nih.gov/

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. CRC Press.