Enrichment analyses - uic-ric/uic-ric.github.io GitHub Wiki

The results from the differential analysis of any sort of -omics data may be a long list of significantly different features, e.g. genes, proteins or small molecules. If the list is more than a small handful, e.g. < 20, it can be hard to decide where to begin or how to interpret all of these results. In this case Enrichment or Pathway analyses can provide insight into how these differences may correspond to biologically relevant collections of features (genes, proteins, molecules) such as pathways.

There are two main modes or styles of enrichment analyses that are based on the statistical test being used.

  • Fisher's Exact Test - Using Fisher's exact test to see if the ratio of significantly different features with a collection, e.g. pathway or ontology term, is higher (enriched) as compared to the overall ratio of significantly different features.

  • GSEA - The Gene Set Enrichment Analysis provides insight into possible up vs. down relationships in defined feature collections, e.g. pathways or ontology terms, when looking at pairwise comparisons of the experimental groups.

Fisher's Exact Test

This is the most common statistic used in Enrichment/Pathway analyses. The underlying concept is to compare a subset feature list, e.g. list of significantly different genes, with a list of know/predefined groups/collections of features, e.g. pathways or ontology terms, and look for instances in which the ratio of selected features (significantly different) within a given collection of features (pathway) is significantly greater than the overall ratio of selected features in all possible features in the data. This type of analysis can also be called a hyper-geometric test or over-representation analysis (ORA).

A key benefit of this mode of analysis is that you just need a list of selected features and how the list is selected does not affect the analysis. Although, the selection process will be very important in the interpretation of the results. Some examples of feature selection might be...

  • Features significantly different in a particular comparison (pairwise or term test)
  • Features that are significantly different in one pairwise comparison, e.g. treatment vs. control in WT, that are not significant in a different comparison, e.g. treatment vs. control in KO.
  • Features in a particular cluster derived from a data driven clustering of the data.
  • From variant calling data, genes that had particular types of variants, e.g. synonymous vs. non-synonymous.
  • From epigenomics experiment (ChIP-seq, ATAC-seq), genes downstream of peaks of interest.

However, one main disadvantage of this type of analysis is that any sort ranking or degree of significant is lost. It only matters if the feature is in the selected list vs. not. If the selection include all genes that were significantly different with a q < 0.05 a gene with a very small q value (1e-15) would be treated the same as a gene with q=0.042. Furthermore the direction in the data, e.g. increased in treatment vs. control, is also lost. One work-around can be to set up separate lists for features that increase vs. those that decreased.

Overview of results

This sort of analysis is commonly performed in most commercial pathway analysis tools, e.g. IPA and MetaCore. The format of the results can vary by tool and it is best to check any help guides for details on the analysis results.

For analyses performed by RIC you may receive a XLSX spreadsheet with the following columns:

  • Category - The category for the terms, e.g. GO or KEGG.
  • Term - The ID for the term, e.g. GO:0005739 or map00910.
  • Description - Description/name of this term.
  • Intersection with gene list - Number of features in this term that were in the input list of selected features
  • Total in gene list - Total number of features in the input list of selected features
  • Total in term list - Total number of features in this term.
  • Background total - Total number of features in your dataset.
  • Log2 Fold enrichment - The log2 of the enrichment. This is basically the log2 of the different in the ratio of selected features in the term/group as compared to the ratio of selected features in the entire dataset. A value of "1" would indicate that the ratio of selected features in this term is 2x higher than the overall ratio of selected features in the dataset. "2" is a 4x enrichment, 3=8x enrichment, etc.
  • p-value - The nominal p value of the Fisher's Exact test.
  • q-value - The FDR corrected p value of the Fisher's Exact test.
  • Genes in intersection - List of the selected features present in this term.

GSEA - Gene Set Enrichment Analysis

GSEA is a statistic that looks for over-represented features in a ranked list of features. In this type of analysis the results will inherently have an up vs. down due to the ranking of the features. Due to the ranking nature, an initial differential analysis of the data is not required as one would provide the information for all of the features, though one can use differential statistics to create a ranking metric for the features.

The main drawback for this type of analysis is that it only addresses pairwise comparisons, making it difficult to apply in multiple group comparisons, such as an ANOVA test or genes from a clustering result, or non-quantitative instances (e.g., a list of genes with damaging variants).