Homeworks - bcb420-2023/Metyu_Melkonyan GitHub Wiki

Homework assignment (Journal entry) - Docker

Objective

  • Create your own Docker image built from the course base docker image
  • Add additional libraries to the image:
  • Create a container with your Docker image
  • Use your imagination to create a basic R Notebook that does the following
    • Create a 5 by 10 matrix of random integers
    • Define column names as cond1, cond2, cond3, cond4, cond5, ctrl1, ctrl2, ctrl3, ctrl4, ctrl5
    • Define row names as gene1, gene2, gene3 ...
    • Compute the fold change for each gene.
    • Push your Docker file and your basic R Notebook to your GitHub repo
    • Log your progress in your journal

Duration

Estimated time : ~ 1 hour | Time taken ~ 1.5 hours Date started: 2023-01-17 Completed: 2023-01-20

Procedures

  1. Creating an image with R Notebook
  2. Image was built with docker build -t homework1 .
  3. Ran the container with docker run -e PASSWORD=secretroom -v ${PWD}:/home/rstudio/projects -p 8787:8787 risserlin/bcb420-base-image
  4. Created R Notebook on the local Rstudio server
  5. composed script docker run -e PASSWORD=changeit --rm \ -v "$(pwd)":/home/rstudio/projects -p 8787:8787 \ risserlin/bcb420-base-image:winter2022

Docker Script

# Building image from course base image
FROM risserlin/bcb420-base-image:latest

# Packages required for the analysis

RUN R -e "BiocManager::install(c('DESeq2', 'pheatmap'))"

Conclusion

  • I got more familiarized with using different package types of Docker such as arm64
  • Homework allowed me to build an image and run it as a container.
  • I got more familiarized myself with using R Markdown syntaxes as well as using different command line approaches to run arm64 Docker
  • Both Windows and Mac Docker platforms are integrated via using the knitting tool of the R Markdown
  • I have successfully downloaded the Docker image of BCB420
  • Docker image was wrapped with a container successfully
  • The script in the docker file allowed to download of the required packages of DESeq2.

Notes

  • I had problems loading my repo to GitHub that's why I had to pass the deadline. Sorry for the inconvenience.
  • The GitHub Desktop application did not allow me to push my original code to the repository. Please see the insight!!

Homework Assignment (Journal entry) - Annotation sources: OncoKB

Objectives

  • Finding an annotation source through webs earching and literature review.
  • Analyzing the database based on it's publication date, authors as well as information database contains.

Duration

Estimated time: ~ 2 hours | Time taken: ~ 2.5 hours Date started: 2023-02-20 Completed: 2023-02-25

Questions

  1. What sort of data is it? What sort of information does it offer us?

OncoKB is a standardized curated cancer annotation resource for oncogenic effects and treatment implications. It has been developed by Memorial Sloan Kettering Cancer Center (MSK). OncoKB database. I will be using this information to better annotate different oncogene functions as well as their role in Pancreas Cancer. Functional prediction and tumour initiation and progression of somatic mutations are the main points that I will be seeking from this database. This comprehensive annotations source allows me to also navigate through cBioPortal, COSMIC(Catalogue of Somatic Mutations in Cancer). An API has been developed and it offers a web-based tool through which I can navigate.

  1. When and where was it published? Was it published?

It was published on May 20 2016 by Chakravarty et al. (2016). It was published in the JCO Clinical Cancer Informatics such as MK Several other papers were published indicating its usage such as MSK and [others(https://ashpublications.org/blood/article/134/Supplement_1/2148/428130/Annotation-of-Somatic-Genomic-Variants-in). other tools use OncoKB such as [Genome Nexus(https://pubmed.ncbi.nlm.nih.gov/35148171/).

  1. Is this annotation set updated regularly or is it a static source?

Yes, it has been updated regularly. The last update was on February 10, 2023.

  1. Where can I find this data? (link to the download web address or site or publication where it can be found)

The Cancer gene list can be found Here. The GitHub page for the gene-annotator.

  1. How is the data formatted and released? Does it exist in some sort of standard file format?

OncoKB annotations are available through the website. The format of the data includes 5.,983 tumour samples for 19 cancer samples. These tumour samples were tested based on levels of resistance against certain treatments used such as FDA-approved drugs. Actionable alterations in tumour cell structure are another factor that determined the data format. Cancer types differ in their response to treatments and this can lead to different treatments against particular cancer types. Lastly, tumour-type characteristics determine the clinical implications. The clinical implications are then stored in the cBioPortal or other databases.

  1. What identifiers are associated with these annotations?

OcoKB uses HGNC symbols as gene names like BRCA2, TP53, and CFTR. Also, it uses ENSEMBL ID, ENSEMBL Genome Browser location, NCBI gene number and RefSeq accession number.

Homework Assignment (Journal entry) - g:Profiler

Objectives

  • Querying the gene list on the g:Profiler for the list of genes and gene terms
  • Getting accustomed to using g:Profiler and using different parameters rather than recommended

Parameters Data sources: Reactome, Go biological process, and Wiki pathways Multiple hypothesis testing: Benjamini Hochberg correction

Duration

Estimated time: ~ 1 hour | Time taken: 1.5 hours Date started: 2023-03-06 Completed: 2023-03-07

Questions

  1. What is the top term returned in each data source?
    1. GO biological process: Immune system process, GO:0002376
    1. Reactome: REAC:R-HSA-168256
    1. WikiPathways: Alloqraft rejection, WP2328
  1. How many genes are in each of the above genesets returned? (hint, in the Detailed results tab of g:Profiler results if you click on the arrows next to the stats heading you will be able to see the number of genes in a term, the number of genes in your query and the number of genes in your query that are also in your term)
    1. Total 2683 of genes in the geneset. There are 426 in the query. Both query and the geneset share 287
    1. There are 2041 total genes in this gene set. 330 in the query. Both the total and my query share 217 genes together
    1. There are 88 total genes in this gene set. The query has 289. Both queries and the geneset have 30 shared genes
  1. How many genes from our query are found in the above genesets?
    1. 287 out of 426 query genes belong to the geneset
    1. 217 out of 330 query genes belong to the geneset
    1. 88 out of 289 query genes belong to the geneset
  1. Change g:Profiler settings so that you limit the size of the returned genesets. Make sure the returned genesets are between 5 and 200 genes in size. Did that change the results? By limiting to 5 to 200 results ;top hits for Wikipathways changed to WP3945, for the Reactome it changed to REAC:R-HSA-198933, for the Go biological pathways it changed to GO:0019882.

  2. Which of the 4 ovarian cancer expression subtypes do you think this list represents? The three data sources are more associated with the immunological gene associations with ovarian cancer. I do feel there is much in detail about immune cell signalling information. I do feel this dataset is useful for immune cell effect or lymphoid cancer associated with ovarian cancer expression subtypes. Most of the pathways are associated with immune cell functions, immune cell signalling or different concepts related to the immune system. I do feel gene ontology is more related to the immune cell's role and function. Reactome and wiki pathways are more related to the chain enzymatic reaction of these immune cells.

  3. Bonus: The top gene returned for this comparison is TFEC (Ensembl gene id: ENSG00000105967). Is it found annotated in any of the pathways returned by g: Profiler for our query? What terms is it associated with it g: Profiler?

newplot

Homework Assignment (Journal entry) - GSEA

Objectives

  • Downloading the ranked gene list
  • Data retrieval from Bader's lab geneset of symbols published on 1 March 2021 collection containing GOBP all pathways but no IEA.
  • GSEA analysis by using GRE guide

Parameters * Maximum geneset size: 200 * Minimum geneset size: 15 * Number of permutations: 1000

Duration

Estimated time: 1.5 | Time Taken: ~ 2 hours Date started: 2023-03-17 Completed: 2023-03-19

Summary

Summary of the analysis until Assignment2

Thresholded vs non-thresholded analysis

GSEA enrichment analysis: The analysis contains 1000 permutations and 200 max as well as 15 minimum size genesets. These parameters are for the stringency.

Questions

  1. Explain the reasons for using each of the above parameters.

For larger datasets and smaller data sets the GSEA can be less accurate. I am using 200 and 15 to narrow my scope for gene enrichment searches. The search results lower than 15 might be not well annotated. Gene set permutation is set to 1000. This value allows for a shorter running time and allows for more specificity of the analysis results we have. To sum up, our set of parameters allows for more specific results for both mesenchymal subtype genes as well as for immunoreactive subtype genes.

  1. What is the top gene set returned for the Mesenchymal subtype? What is the top gene set returned for the Immunoreactive subtype? For each of the genesets answer the below questions:

For the Mesenchymal subtype: The top gene set returned is HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION. The ES is 0.86, NES is 2.57 and the FDR q-value is 0.0 The leading edge size is 145 genes The top gene is FBN1 4th rank in the list of genes.

For Immunoreactive subtype: The top geneset is HALLMARK_INTERFERON_ALPHA_RESPONSE with ES value of -0.86, NES value of -2.90, FDR q-value of 0.0. The PROCR is the top gene with 1960 the rank on the geneset

Reference List for the Homeworks

  • https://rmarkdown.rstudio.com/articles_intro.html

  • Chakravarty D, Gao J, Phillips S, Kundra R, Zhang H, Wang J, Rudolph JE, Yaeger R, Soumerai T, Nissan MH, et al. 2017. OncoKB: A precision oncology knowledge base. JCO Precis Oncol.(1):1–16. doi:10.1200/po.17.00011. http://dx.doi.org/10.1200/po.17.00011.

  • Gene Ontology Consortium. 2015. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43(Database issue):D1049-56. doi:10.1093/nar/gku1179. http://dx.doi.org/10.1093/nar/gku1179.

  • Reimand J, Kull M, Peterson H, Hansen J, Vilo J. 2007. g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 35(suppl_2):W193–W200. doi:10.1093/nar/gkm226. http://dx.doi.org/10.1093/nar/gkm226.

  • Srivastav P. A docker tutorial for beginners. A Docker Tutorial for Beginners. [accessed 2023 Mar 19]. https://docker-curriculum.com/.

  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 102(43):15545–15550. doi:10.1073/pnas.0506580102. http://dx.doi.org/10.1073/pnas.0506580102.

  • Home - Bader lab @ the University of Toronto. Baderlab.org. [accessed 2023 Mar 19]. https://baderlab.org/.