Community Detection of Functional Enrichment Networks - bcb420-2023/Jielin_Yang GitHub Wiki

Date: 2023-03-25

The typical functional enrichment method such as threshold overrepresentation analysis (TRA) and gene set enrichment analysis (GSEA) usually results in a table of gene set or pathways and the enrichment score of each pathways. Depending on the context of the biological experiment, the size of the enrichment pathways could be large or small. However, it is usually that many databases are being searched for enrichment pathways, and the number of pathways is large. In this case, it is difficult to find the most significant pathways, and more importantly, the most biologically relevant one among the large number of pathways. Furthermore, the enrichment score, or the significance p-value, is not enough to describe the relationship between the pathways and the genes. Therefore, such analyses are typically followed by network visualization and analysis using the enrichment results.

From the EnrichmentMap pipeline, the construction of the network is by constructing a gene set/pathway as a node, and the common genes between the gene sets/pathways are used to determine the connectivity between two nodes. As mentioned previously, the enrichment result could contain a large amount of gene sets, which would highly complicate the network and make it difficult for humans to interpret. Thus, it is important to be able to extrapolate the major themes of the network and to be able to visualize the network in a way that is easy to interpret, and this is where the community detection algorithms come in.

A community detection algorithm can be considered as a clustering algorithm that partitions the nodes into several independent (or potentially overlapping) sets that are densely connected within each set, but sparsely connected between sets. For example, a common method called the Louvain method separate the nodes into independent sets, and it optimizes the modularity of the network, which measure the density of the edges within each cluster compared to edges outside of the cluster. While the algorithm itself does not distinguish what each of the node represents, in the biological context, the connectivity of the nodes are determined by the number and proportion of common genes between the gene sets. Therefore, it is expected that a community would contain genes sets and pathways that share many common genes, which are thus biologically related. However there are still challenges in the detection of biological communities. For example, in cytoscape, one common behavior if such community detection algorithm when trying to extrapolate the common themes of a network is that the several communities are still highly interconnected, even the threshold for including nodes and edges are set to be very strict. This would lead to a problem when interpreting the network, as the themes for the network are not clear because we are unable to completely identify gene sets that are strongly connected to both communities. Therefore, the interpretation of the network is still reliant on the user's knowledge of the biological context and manual formatting of the network. Therefore, the presentation of the network must also consider the aim of the study, and using the biological knowledge of the relevant pathways to either use the network to explain the phenotypical changes at the gene level, or to use the network to reveal previously unidentified biological pathways that can be further investigated and validated.

References

Gaiteri C, Chen M, Szymanski B, et al. (2015) Identifying robust communities and multi-community nodes by combining top-down and bottom-up approaches to clustering. Sci Rep. 5:16361.

Reimand J, Isserlin R, Voisin V, et al. (2019) Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat Protoc. 14(2):482-517.

Harrington LX, Way GP, Doherty JA, Greene CS. (2018) Functional network community detection can disaggregate and filter multiple underlying pathways in enrichment analyses. Pac Symp Biocomput. 2018;23:157-167.