Assignment 1: Data set selection and initial Processing - bcb420-2023/Metyu_Melkonyan GitHub Wiki

Part 1:Finding the Expresison Dataset from GEO

Objective

To find RNA expreression dataset for the part 1 of Assignment 1
To get familiarize with the RNA expression dataset
Explore the dataset to see non-redundant genes and many more genes that can be assocaited with other cancer types
Do further research on pancreas cancer
Do further research on RNA-seq and other methods to quantify the gene expression of gene sets

Duration

Time estimated : 2 hours Time took: 3 hours Date started: 2023-02-02
Completed: 2023-02-02

Micro Array explanation: Using a prope and hybdidization. Later image analysis allows for expression measurement. chip. The illunation method is sued for measure the fleurescent.

RNA seq:

RNA seq sampling
RNA extraction target enrichment analysis
Freamentation of the RNA molecules and cDNA library assembly.
Sequencing and FASTAQ file generation
Transcriptome mapping via using the sequncing data.
Bioinformatics: Differential expression analysis, variant alling analysis, annotation, novel transcription discover and RNA editing via using different computational methods.

Bulk RNASeq:

We are using Bulk RNA seq for this assignment because it has a small size and using HTSeq raw count which faciliates the normalization process *The GEO database is used to retrieve the gene expression data along with GEOmetadb has been used. *SQLite has been used to retrieve information from the GEOmetadatabase

Conclusion

The template code was structured based on the query search
The potential data set for GSE164730 is used and found
I got more familiarize with RNA-seq procedure as well as the methods
Other research in Pancreas cancer and different cancer's is promising.

if(!file.exists('GEOmetadb.sqlite')) 
  GEOmetadb::getSQLiteFile()

con <- DBI::dbConnect(RSQLite::SQLite(),'GEOmetadb.sqlite')

Geo_tables <- DBI::dbListTables(con)
Geo_tables

results <- DBI::dbGetQuery(con,'select * from gpl limit 5')
knitr::kable(head(results[,1:10]), format = "html")

sql <- paste("SELECT DISTINCT gse.title,gse.gse, gpl.title,",
" gse.submission_date,",
" gse.supplementary_file",
"FROM",
" gse JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gse.submission_date > '2014-01-01' AND",
" gse.title LIKE '%Cancer%' AND", 
" gpl.organism LIKE '%Homo sapiens%' AND",
" gpl.technology LIKE '%High-throughput sequencing%' ",
" ORDER BY gse.submission_date DESC",sep=" ")

result_query <- DBI::dbGetQuery(con,sql)

Part 2 Normalization & Data Cleaning.

Objective

To clean the gene expression dataset of GSE131222
To normalize the expression values
Analyze the normalization values
Validate if the normalization values make sense
Further match tne nromalization values with the results that you obtain from the final concluding analysis (Does it make sense!)

Duration

Time Estimated 3 hours Time taken: 5 hours Date started: 2023-02-08 Completed: 2023-02-12

Conclusion

Different data sets were cleaned
Replicated gene expression rows were elimianted
Normalized used to see the difference between unreplicated and replicated values.
The normalization values make sense, due to it's consensus with the expected analysis result (Next part validates it !)

Part 3 Interpretation of The Expression Data

Objectives

To interpret data and have an understanding of what the actual case study is conducting
Use HUGO symbols provided by the dataset to sort and map the indentifiers
Prevent any inconsistencies within data that can be generated via HUGO symbol covnersion
Find the difference between normalized and the converted data
Analyze the HUGO symbols if they make sense, and if they correlate with the previous symbols

Duration

Time estimate 4 hours Time Taken: 8 hours Completed: around the same time of the submission Start date: Unknown (Please time yourself next time!)

Conclusion

The data has been mapped with the correct identifier with normalized values
Normalized and clean data has been visualzied using different plots
The divergence and variance have been shown and calculated
Importantly!! (Attention) The HUGO symbols are validated by matching the HUGO symbol data, this allowed me to both validate the identity of the HUGO symbols that I have at the end. This is important !

References

Adams, C. R., Htwe, H. H., Marsh, T., Wang, A. L., Montoya, M. L., Subbaraj, L., Tward, A. D., Bardeesy, N., & Perera, R. M. (2019). Transcriptional control of subtype switching ensures adaptation and growth of pancreatic cancer. ELife, 8. https://doi.org/10.7554/elife.45313

Bioconductor - home. (n.d.). Bioconductor.org. Retrieved February 13, 2023, from https://www.bioconductor.org/

EdgeR. (n.d.). Bioconductor. Retrieved February 13, 2023, from https://bioconductor.org/packages/release/bioc/html/edgeR.html

Ensembl genome browser 109. (n.d.). Ensembl.org. Retrieved February 12, 2023, from http://useast.ensembl.org/index.html

GEO overview. (n.d.). Nih.gov. Retrieved February 13, 2023, from https://www.ncbi.nlm.nih.gov/geo/info/overview.html

National center for biotechnology information. (n.d.). Nih.gov. Retrieved February 12, 2023, from https://www.ncbi.nlm.nih.gov/

Xie, Y., Allaire, J. J., & Grolemund, G. (2018). R markdown: The definitive guide. CRC Press.