Homeworks - bcb420-2023/Metyu_Melkonyan GitHub Wiki

Homework assignment (Journal entry) - Docker


  • Create your own Docker image built from the course base docker image
  • Add additional libraries to the image:
  • Create a container with your Docker image
  • Use your image to create a basic R Notebook that does the following
    • Create a 5 by 10 matrix of random integers
    • Define column names as cond1, cond2, cond3, cond4, cond5, ctrl1, ctrl2, ctrl3, ctrl4, ctrl5
    • Define row names as gene1, gene2, gene3 ...
    • Compute the fold change for each gene.
    • Push your Docker file and your basic R Notebook to your github repo
    • Log your progress in your journal


Estimated time : ~ 1 hour | Time taken ~ 1.5 hours Date started: 2023-01-17 Completed: 2023-01-20


  1. Creating and image with R Notebook
  2. Image was build with docker build -t homework1 .
  3. Ran the container with docker run -e PASSWORD=secretroom -v ${PWD}:/home/rstudio/projects -p 8787:8787 risserlin/bcb420-base-image
  4. Created R Notebook on the local Rstudio server


  • I got more familiarized with using different package types of Docker such as arm64
  • Homework allowed me to build an image and run it as a container.
  • I got more familiarized myself with using R Markdown syntaxes as well as using different command line approaches to run arm64 Docker
  • Both Windows and Mac Docker platforms are integrated via using the knit tool of the R Markdown
  • I have sucessfully downlaoded the Docker image of BCB420
  • Docker image was wrapped with a container sucessfully
  • The script in the docker file allowed to download the required packages of DESeq2.


  • I had problems loading my repo to github that's why I had to pass the deadline. Sorry for the inconvenience.
  • The Github Desktop application did not allow me to push my original code to the repository.

Homework Assignment (Journal entry) - Annotation sources: OncoKB


  • Finding an annotation source through webs earching and literature review.
  • Analyzing the database based on it's publication date, authors as well as information database contains.


Estimated time: ~ 2 hours | Time taken: ~ 2.5 hours Date started: 2023-02-20 Completed: 2023-02-25


  1. What sort of data is it? What sort of information does it offer us?

OncoKB is a standardized curated cancer annotation resource for oncogenic effect and treatment implications. It has been developed by Memorial Sloan Kettering Cancer Center (MSK). OncoKB database. I will be using this information to better annotate different oncogene function as well as their role in Pancreas Cancer. Functional prediction and tumour initiaton and progression about soamtic mutations are the main points that I will be seeking from this database. This comprehensive annotations source allows me to also navigate through cBioPortal, COSMIC(Catalogue of Somatic Mutations in Cancer). An API has been developed and it offers a web-based tool thriugh which I am able to navigate.

  1. When and where was it published? Was it published?

It was published on May 20 2016 by Chakravarty et al. (2016). It was published in the JCO Clinical Cancer Informatics such as MK Several other papers were published indicating its usage such as MSK and [others(https://ashpublications.org/blood/article/134/Supplement_1/2148/428130/Annotation-of-Somatic-Genomic-Variants-in). There are other tools that uses OncoKB such as [Genome Nexus(https://pubmed.ncbi.nlm.nih.gov/35148171/).

  1. Is this annotation set updated regularly or is it a static source?

Yes it has been updated regularly. The last update was on February 10 2023.

  1. Where can I find this data? (link to the download web address or site or publication where it can be found)

The Cancer gene list can be found Here. The github page for the gene-annotator.

  1. How is the data formatted and released? Does it exist in some sort of standard file format?

OncoKB annotations are available through the website. The format of the data includes 5.,983 tumour samples for 19 cancer samples. These tumour samples were tested based on levels of resistance against certain treatments used such as FDA-approved drugs. Actionable alterations in tumour cell structure is one another factor that determined the data format. Cancer types differ in tehir response to treatments and this can lead to different treatments against particular cancer types. Lastly, tumour type characteristics determine the clinical implications. The clinical implications are then sotred in the cBioPortal or other databases.

  1. What identifiers are associated with these annotations?

OcoKB uses HGNC symbol as gene names. It uses ENSEMBL ID, ENSEMBL Genome Browser location, NCBI gene number and RefSeq acession number.

Homework Assignment (Journal entry) - G:Profiler


  • Querying the gene list on the G-Profiler for the list of genes and gene terms
  • Getting accusomed to using G-Profiler and using different parameters rather than reccomended

Parameters Data sources: Reactome, Go biologoical process, and Wiki pathways Multiple hypothesis testing: Benjamini hochberg correction


Estimated time: ~ 1 hour | Time taken: 1.5 hours Date started: 2023-03-06 Completed: 2023-03-07


1.What is the top term returned in each data source?

    1. GO biological process: Immune system process, GO:0002376
    1. Reactome: REAC:R-HSA-168256
    1. WikiPathways: Alloqraft rejection, WP2328

2.How many genes are in each of the above genesets returned? (hint, in the Detailed results tab of g:profiler results if you click on the arrows next to the stats heading you will be able to see the number of genes in a term, number of genes in your query and number of genes in your query that are also in your term)

    1. Total 2683 of genes in the geneset. There are 426 in the query.Both query and the geneset shares 287
    1. There are 2041 total genes in this geneset. 330 in the query. Both total and my query shares 217 genes together
    1. There are 88 total genes in this geneset. The query has 289. Both query adn the geneset has 30 shared genes

3.How many genes from our query are found in the above genesets?

    1. 287 out of 426 query genes belongs to the geneset
    1. 217 out of 330 query genes belongs to the geneset
    1. 88 out of 289 query genes belongs to the geneset

4.Change g:profiler settings so that you limit the size of the returned genesets. Make sure the returned genesets are between 5 and 200 genes in size. Did that change the results? By limiting to 5 to 200 results ;top hits for Wikipathways changed to WP3945, for the Reactome it changed to REAC:R-HSA-198933, for the Go biological pathways it changed to GO:0019882.

5.Which of the 4 ovarian cancer expression subtypes do you think this list represents? The three data sources is more assocaited with the immunological gene assocaitions with the ovarian cancer. I do feel there ismuch in detail about immune cell signalling inforamtion. I do feel this dataset is usefull for immune cell effect or lymphoid cancer assocaiton with the ovarion cancer expression subtypes. Most of pathways are assocaited with immune cell functions, immune cell signalling or different concepts related to the immune system. I do feel gene ontoloy is more related to the immune cell role and function.Reactome and wikipathwyas are more related to the chain enzymatic reaction of these immune cells.

6.Bonus: The top gene returned for this comparison is TFEC (ensembl gene id:ENSG00000105967). Is it found annotated in any of the pathways returned by g:profiler for our query? What terms is it associated with it G-profiler?

Homework Assignment (Journal entry) - GSEA


  • Downloading the ranked gene list
  • Data retrieval from Bader's lab geneset of symbols published at 1 March 2021 collection containing GOBP all pathways but no no IEA.
  • GSEA analysis by using GRE guide

Parameters * Maximum geneset size: 200 * Minimum geneset size: 15 * Number of permutations: 1000


Estimated time: 1.5 | Time Taken: ~ 2 hours Date started: 2023-03-17 Completed: 2023-03-19


  1. Explain the reasons for using each of the above parameters.

For larger datasets and smaller data sets the GSEA can be less accurate. I am using 200 and 15 to narrow my scope for gene enrichment search.The search results lower than 15 might be not well annotated. Gene set permutation is set to 1000. This value allows for shorter running time and allows for more specifity of the analysis results we have. To sum up our set of parameters allows for more specific results for both mesenchymal sub type genes as well as for immunoreactive subtype genes.

  1. What is the top gene set returned for the Mesenchymal sub type? What is the top gene set returned for the Immunoreactive subtype? For each of the genesets answer the below questions:

For Mesenchymal sub type:The top gene set returned is HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION. The ES is 0.86, NES is 2.57 and FDR q-val is 0.0 The leading edge size is 145 genes The top genes is FBN1 4th rank in the genes list.

For Immunoreactive subtype: The top geneset is HALLMARK_INTERFERON_ALPHA_RESPONSE with ES value of -0.86, NES value of -2.90, FDR q-val value of 0.0. The PROCR is the top gene with 1960 th rank on the geneset

Reference List for the Homeworks

  • https://rmarkdown.rstudio.com/articles_intro.html

  • Chakravarty D, Gao J, Phillips S, Kundra R, Zhang H, Wang J, Rudolph JE, Yaeger R, Soumerai T, Nissan MH, et al. 2017. OncoKB: A precision oncology knowledge base. JCO Precis Oncol.(1):1–16. doi:10.1200/po.17.00011. http://dx.doi.org/10.1200/po.17.00011.

  • Gene Ontology Consortium. 2015. Gene Ontology Consortium: going forward. Nucleic Acids Res. 43(Database issue):D1049-56. doi:10.1093/nar/gku1179. http://dx.doi.org/10.1093/nar/gku1179.

  • Reimand J, Kull M, Peterson H, Hansen J, Vilo J. 2007. g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 35(suppl_2):W193–W200. doi:10.1093/nar/gkm226. http://dx.doi.org/10.1093/nar/gkm226.

  • Srivastav P. A docker tutorial for beginners. A Docker Tutorial for Beginners. [accessed 2023 Mar 19]. https://docker-curriculum.com/.

  • Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, et al. 2005. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 102(43):15545–15550. doi:10.1073/pnas.0506580102. http://dx.doi.org/10.1073/pnas.0506580102.

  • Home - Bader lab @ the University of Toronto. Baderlab.org. [accessed 2023 Mar 19]. https://baderlab.org/.