6. g:Profiler Assignment: HW 2 - bcb420-2022/Inika_Prasad GitHub Wiki

Use g:Profiler to query a dataset

Start time: 15:30, March 7 Estimated time: 2 hours End time:

Use this list of genes:genelist.txt as your query set and run a g:profiler enrichment analysis with the following parameters:

  • Data sources : Reactome, Go biologoical process, and Wiki pathways
  • Multiple hypothesis testing - Benjamini hochberg

What is to be done with HGNC symbols that have more than 1 ENSEMBL ID associated with them? For example, with HLA-DQA1 and SIGLEC5. Strategy: select the ENSEMBL ID with the most GO annotations

  1. What is the top term returned in each data source?
  • GO Biological Process: immune system process (Term ID GO:0002376)
  • Reactome: Immune System (REAC:R-HSA-168256)
  • Wiki Pathways: TYROBP causal network in microglia (WP:WP3945)
  1. How many genes are in each of the above genesets returned? (hint, in the Detailed results tab of g:profiler results if you click on the arrows next to the stats heading you will be able to see the number of genes in a term, number of genes in your query and number of genes in your query that are also in your term)
  • GO Biological Process Immune system process: 2748
  • Reactome Immune System: 2041
  • Wiki Pathways TYROBP causal network in microglia: 63

Why is the number of genes in my query different for each of the datasets? Because some genes are not described in the datasets and so they are excluded from the query.

  1. How many genes from our query are found in the above genesets? The following numbers of genes are present in my query as well as my term.
  • GO Biological Process Immune system process: 287
  • Reactome Immune System: 218
  • Wiki Pathways TYROBP causal network in microglia: 27
  1. Change g:profiler settings so that you limit the size of the returned genesets. Make sure the returned genesets are between 5 and 200 genes in size. Did that change the results?

Reduce the term size from 5 to 200 genes changed the results for GO Biological Process and Reactome. This helps exclude the umbrella terms with many genes included (immune system process is very general and has hundreds of genes. By limiting the terms we get more specific answers

  • GO Biological Process: antigen processing and presentation (GO:0019882)
  • Reactome: Immunoregulatory interactions between a Lymphoid and non-Lymphoid cell (REAC:R-HSA-198933)
  • **Wiki Pathways: **TYROBP causal network in microglia (WP:WP3945)
  1. Which of the 4 ovarian cancer expression subtypes do you think this list represents? What are the 4 ovarian cancer subtypes? Immuno, mesenchymal This list likely came from immuno subtype

  2. Bonus: The top gene returned for this comparison is TFEC (ensembl gene id:ENSG00000105967). Is it found annotated in any of the pathways returned by g:profiler for our query? What terms is it associated with it g:profiler?

Downloading CSV (All Results not selected) and searching for TFEC gives no results.

Based on the g Profiler FAQ page: For a small query of not directly related genes, no terms might show up as significant. If you still want to explore to which of the terms the input genes belong, then please pick the All results option from advanced options section. In this way you can explore all the terms where at least one input gene belongs to.

Searching for TFEC alone also does not give results unless the "All Results" parameter is selected. We can thus obtain the full list of gene-term relationships or check particular terms because gProfiler is no longer giving only the statistically significant results after multiple correction is applied. Source

Bonferroni correction is more stringent than the Benjamini Hochberg FDR. GEM: Generic Enrichment Map In the R package: gost is the function to query gprofiler gprofiler2 is the package Benjamini-hochberg is called FDR Exclude KEGG since you can't download the datasets because one needs to pay