Community detection algorithms: a comparative analysis - SergiuTripon/msc-thesis-na-epsrc GitHub Wiki
Home ▸ Literature Survey ▸ Community detection algorithms: a comparative analysis
Year: 2010
Authors: Andrea Lancichinetti and Santo Fortunato
File: lancichinetti_2010.pdf
- 1. GN (Girvan and Newman) benchmark
- 2. LFR benchmark
-
3. The Algorithms
- 3.1 Algorithm of Girvan and Newman (GN)
- 3.2 Fast greedy modularity optimization by Clauset, Newman and Moore (Clauset et al.)
- 3.3 Exhaustive modularity optimization via simulated annealing (simulated annealing)
- 3.4 Fast modularity optimization by Blondel et al. (Blondel et al.)
- 3.5 Algorithm by Radicchi et al. (Radicchi et al.)
- 3.6 Cfinder
- 3.7 Markov Cluster Algorithm (MCL)
- 3.8 Structural algorithm by Rosvall and Bergstrom (Infomod)
- 3.9 Dynamic algorithm by Rosvall and Bergstrom (Infomap)
- 3.10 Spectral algorithm by Donetti and Muñoz (DM)
- 3.11 Expectation-maximization algorithm by Newman and Leicht (EM)
- 3.12 Potts model approach by Ronhovde and Nussinov (RN)
-
4. Tests on the GN benchmark
- 4.1 Algorithm of Girvan and Newman (GN)
- 4.2 Fast greedy modularity optimization by Clauset, Newman and Moore (Clauset et al.)
- 4.3 Exhaustive modularity optimization via simulated annealing (simulated annealing)
- 4.4 Fast modularity optimization by Blondel et al. (Blondel et al.)
- 4.5 Algorithm by Radicchi et al. (Radicchi et al.)
- 4.6 Cfinder
- 4.7 Markov Cluster Algorithm (MCL)
- 4.8 Structural algorithm by Rosvall and Bergstrom (Infomod)
- 4.9 Dynamic algorithm by Rosvall and Bergstrom (Infomap)
- 4.10 Spectral algorithm by Donetti and Muñoz (DM)
- 4.11 Potts model approach by Ronhovde and Nussinov (RN)
- 4.12 Approximate Ranking on the GN benchmark
- 5. Tests on the LFR benchmark
- 6. Summary
Many methods have been developed, using tools and techniques from disciplines like physics, biology, applied mathematics, computer and social sciences.
However, it is still not clear which algorithms are reliable and shall be used in applications. The question of the reliability itself is tricky, as it requires shared definitions of community and partition which are, at present, still missing.
This essentially means that, despite the huge literature on the topic, there is still no agreement among scholars on what a network with communities looks like.
Nevertheless, there has been a silent acceptance of a simple network model, the planted l-partition model, which is often used in the literature in various versions. In this model one "plants" a partition, consisting of a certain number of groups of nodes. Each node has a probability pin of being connected to nodes of its group and a probability pout of being connected to nodes of different groups.
As long as pin > pout the groups are communities, whereas when pin <= pout the network is essentially a random graph, without community structure.
- The most popular version of the planted l-partition model was proposed by Girvan and Newman;
- the graph consists of 128 nodes, each with expected degree 16, which are divided into four groups of 32;
- The GN benchmark is regularly used to test algorithms for community detection. Indeed, algorithms can be compared based on their performance on this benchmark;
- However, the GN benchmark has two drawbacks:
- all nodes have the same expected degree;
- all communities have equal size;
These features are unrealistic, as complex networks are known to be characterized by heterogeneous distributions of degree and community sizes.
In recent papers, we have introduced a new class of benchmark graphs (LFR benchmark), that generalize the GN benchmark by introducing power law distributions of degree and community size.
The new graphs are a real generalization, in that the GN benchmark is recovered in the limit case in which the exponents of the distributions of degree and community sizes go to infinity. Most community detection algorithms perform very well on the GN benchmark, due to the simplicity of its structure. The LFR benchmark, instead, poses a much harder test to algorithms, and makes it easier to disclose their limits.
Moreover, the LFR benchmark graphs can be built very quickly: the complexity of the construction algorithms is linear in the number of links of the graph, so one can perform tests on very large systems, provided the method at study is fast enough to analyze them.
It is the first algorithm of the modern age of community detection in graphs. It is a hierarchical divisive algorithm, in which links are iteratively removed based on the value of their betweenness, which expresses the number of shortest paths between pairs of nodes that pass through the link.
In its most popular implementation, the procedure of link removal ends when the modularity of the resulting partition reaches a maximum. The modularity of Newman and Girvan is a well known quality function that estimates the goodness of a partition based on the comparison between the graph at hand and a null model, which is a class of random graphs with the same expected degree sequence of the original graph.
The algorithm has a complexity O(N3) on a sparse graph. In the following we will refer to it as GN.
This method is essentially a fast implementation of a previous technique proposed by Newman. Starting from a set of isolated nodes, the links of the original graph are iteratively added such to produce the largest possible increase of the modularity of Newman and Girvan at each step.
The fast version of Clauset, Newman and Moore, which uses more efficient data structures, has a complexity of O(N log2 N) on sparse graphs.
The goal is the same as in the previous algorithm, but the precision of the final estimate of the maximum is far higher, due to the exhaustive optimization, at the expense of the computational speed. The latter cannot be expressed in closed form, as in the cases above, as it depends on the parameters used for the optimization. We will stick to the procedure used by Guimerá and Amaral.
This is a multistep technique based on a local optimization of Newman-Girvan modularity in the neighborhood of each node. After a partition is identified in this way, communities are replaced by supernodes, yielding a smaller weighted network.
The procedure is then iterated, until modularity (which is always computed with respect to the original graph) does not increase any further. This method offers a fair compromise between the accuracy of the estimate of the modularity maximum, which is better than that delivered by greedy techniques like the one by Clauset et al. above, and computational complexity, which is essentially linear in the number of links of the graph.
This algorithm is in the spirit of that by Girvan and Newman above. In fact, it is a divisive hierarchical method, where links are iteratively removed based on the value of their edge clustering coefficient, which is defined as the ratio between the number of loops based on the link and the largest possible number of loops that can be based on the link.
The edge clustering co-efficient is a local measure, so its computation is not so heavy as that of edge betweenness, which yields a significant improvement in the complexity of the algorithm, which is O(N2) on a sparse graph. Another major difference from the GN algorithm is the stopping criterion of the procedure, which depends on the properties of the communities themselves and not on the values of a quality function like modularity.
Radicchi et al. considered two types of communities: strong communities are groups of nodes such that the internal degree of each node exceeds its external degree; weak communities are groups of nodes such that the total internal degree of the nodes of the group exceeds their total external degree.
This is a local algorithm proposed by Palla et al. that looks for communities that may overlap, i.e. share nodes. It was the first paper in the physics literature on community detection to address this problem, which is important in many systems like, e. g., social networks.
Communities are defined as the largest possible subgraphs that can be explored by rolling k-cliques across the network, where a k-clique rolls by rotating about any of its component (k − 1)-cliques (which are links when k = 3).
The complexity of this procedure can be high, as the computational time needed to find all k-cliques of a graph is an exponentially growing function of the graph size, but in practical applications the method is rather fast, enabling one to analyze systems with up to 105 nodes.
This is an algorithm developed by S. Van Dongen, which simulates a peculiar diffusion process on the graph. One starts from the right stochastic matrix (or diffusion matrix) of the graph, which is obtained from the adjacency matrix of the original graph by dividing the elements of each row by their sum.
Then one computes an integer power of this matrix (usually the square), which yields the probability matrix of a random walk after a number of steps equal to the number of powers of the right stochastic matrix considered. This step is called expansion.
Next, each element of the matrix is raised to some power α, in order to enhance (artificially) the probability of the walker to be trapped within a community. This step is called inflation.
The expansion and inflation steps are iterated until one obtains the adjacency matrix of a forest (i. e. a disconnected tree), whose components are the communities. This method, widely used in bioinformatics, is strongly dependent on the choice of the parameter α.
Its complexity can be lowered to O(Nk2) if, after each inflation steps, only the k largest elements of the resulting matrix are kept, whereas the others are set to zero. In the following we will refer to the method as MCL.
Here the problem of finding the best cluster structure of a graph is turned into the problem of optimally compressing the information on the structure of the graph, so that one can recover as closely as possible the original structure when the compressed information is decoded.
This is achieved by computing the minimum of a function which expresses the best tradeoff between the minimal conditional information between the original and the compressed information (maximal faithfulness to the original information) and the maximal compression (least possible information to transmit).
The optimization of the function is carried out via simulated annealing, which makes the algorithm quite slow, although one could always go for a faster and less accurate optimization. In the following we will refer to the method as Infomod.
This technique is based on the same principle as the previous one. The difference is that before one was compressing the information on the structure of the graph, here one wishes to compress the information of a dynamic process taking place on the graph, namely a random walk.
The optimal compression is achieved again by optimizing a quality function, which is the Minimum Description Length of the random walk. Such optimization can be carried out rather quickly with a combination of greedy search and simulated annealing. In the following we will refer to the method as Infomap.
This is a method based on spectral properties of the graph. The idea is that eigenvector components corresponding to nodes in the same community should have similar values, if communities are well identified. Donetti and Muñoz focused on the eigenvectors of the Laplacian matrix.
They considered a limited number of eigenvectors, say g, and represented each node of the graph as a geometric point in an Euclidean g-dimensional space, whose coordinates are the eigenvector components corresponding to the node. The points are then grouped with traditional hierarchical clustering techniques.
Of the resulting partitions, one picks the one that maximizes the modularity by Newman and Girvan. The method is rather quick when only a few eigenvectors are computed, which is usually the case, as this can be done via the Lanczos method. In the following we will refer to the method as DM.
Here Bayesian inference is used to deduce the best fit of a given model to the data represented by the actual graph structure. The goodness of the fit is expressed by a likelihood that is maximized by means of the expectation-maximization technique [40].
This leads to a system of self-consistent equations, that can be solved by iteration starting from suitable initial conditions. The equations can be solved rather quickly and fairly large systems can be analyzed in this way (up until 106 nodes).
A nice feature of the method is that it finds the most relevant group structure of the graph, whether the groups are communities or not (in graphs with multipartite structure the classes are rather anti-communities, as there are very few links inside the groups).
A drawback of the method is the fact that one needs to feed the number of groups, which is usually not known a priori. In the following we will refer to the method as EM.
This method is based on the minimization of the Hamiltonian of a Potts-like spin model, where the spin state represents the membership of the node in a given community. A resolution parameter enables one to span several community scales, from very small to very large communities.
The relevant scales are identified by checking for the stability of the partitions obtained for given values of the resolution parameter. This is done by computing the similarity of partitions obtained for the same resolution parameter but starting from different initial conditions.
Peaks in the similarity spectrum correspond to stable/relevant partitions. The method is rather fast, its complexity is slightly superlinear in the number of links of the graph. In the following we will refer to the method as RN.
As we have explained, for the GN benchmark communities are well defined (in principle) up until a value 3/4 = 0.75 for the mixing parameter. We will indicate the mixing parameter with the symbol μt to mean that we refer to topology.
Most methods perform rather well, although all of them start to fail much earlier than the expected threshold of 3/4.
The GN algorithm performs about as well as the MCL.
The MCL is better than the method by Radicchi et al., but is outperformed by modularity-based methods (simulated annealing, Clauset et al., Blondel et al.), which generally do quite well on the GN benchmark, something that was already known from the literature.
The MCL is better than the method by Radicchi et al., but is outperformed by modularity-based methods (simulated annealing, Clauset et al., Blondel et al.), which generally do quite well on the GN benchmark, something that was already known from the literature.
The MCL is better than the method by Radicchi et al., but is outperformed by modularity-based methods (simulated annealing, Clauset et al., Blondel et al.), which generally do quite well on the GN benchmark, something that was already known from the literature.
The method by Radicchi et al. does not have a remarkable performance either, as it also starts to fail for low values of μt, although it does better than the Cfinder.
The Cfinder fails to detect the communities even when μt ∼ 0, when they are very well identified. This is due to the fact that, even when μt is small, the probe clique that explores the system manages to pass from one group to the other and yields much larger groups, often spanning the whole graph.
The MCL is better than the method by Radicchi et al., but is outperformed by modularity-based methods (simulated annealing, Clauset et al., Blondel et al.), which generally do quite well on the GN benchmark, something that was already known from the literature.
Both methods by Rosvall and Bergstrom have a good performance. In fact, up until μt, they always guess the planted partition in four clusters.
Both methods by Rosvall and Bergstrom have a good performance. In fact, up until μt, they always guess the planted partition in four clusters.
The DM and RN methods have a comparable performance as the exhaustive optimization of modularity via simulated annealing.
The DM and RN methods have a comparable performance as the exhaustive optimization of modularity via simulated annealing.
- Dynamic algorithm by Rosvall and Bergstrom (Infomap)
- Structural algorithm by Rosvall and Bergstrom (Infomod)
- Exhaustive modularity optimization via simulated annealing (simulated annealing)
- Fast greedy modularity optimization by Clauset, Newman and Moore (Clauset et al.)
- Fast modularity optimization by Blondel et al. (Blondel et al.)
- Spectral algorithm by Donetti and Muñoz (DM)
- Potts model approach by Ronhovde and Nussinov (RN)
- Algorithm of Girvan and Newman (GN)
- Markov Cluster Algorithm (MCL)
- Algorithm by Radicchi et al. (Radicchi et al.)
- Cfinder
The following input parameters are the same for all benchmark graphs used: the average degree is 20, the maximum degree 50, the exponent of the degree distribution is −2 and that of the community size distribution is −1.
Modularity-based methods have a rather poor performance, which worsens for larger systems and smaller communities, due to the well known resolution limit of the measure. The only exception is represented by the algorithm by Blondel et al., whose performance is very good, probably because the estimated modularity maximum is not a very good approximation of the real one, which is more likely found by simulated annealing.
The Cfinder, the MCL and the method by Radicchi et al. do not have impressive performances either, and display a similar pattern, i.e. the performance is severely affected by the size of the communities (for larger communities it gets worse, whereas for small communities it is decent), whereas it looks rather insensitive to the size of the network.
The DM has a fair performance, but it gets worse if the network size increases.
The same trend is shown by Infomod, where the performance worsens considerably with the increase of the network size.
Infomap and RN have the best performances, with the same pattern with respect to the size of the network and of the communities: up to values of μt ∼ 1/2 both methods are capable to derive the planted partition in the 100% of cases.
We conclude that Infomap, the RN method and the method by Blondel et al. are the best performing algorithms on the LFR undirected and unweighted benchmark.
Since Infomap and the method by Blondel et al. are also very fast, essentially linear in the network size, we wonder how good their performance is on much larger graphs than those considered.
For this reason we carried out another set of tests of these two algorithms on the LFR benchmark, by considering graphs with 50000 and 100000 nodes. We have done so also because in the tests that can be found in the literature on community detection one typically uses very small graphs, and the performance can change considerably on large graphs.
Remarkably, the performance of the method by Blondel et al. is worse than on the smaller graphs, whereas that of Infomap is stable and does not seem to be affected.
Directedness is an essential features of many real networks. Ignoring direction, as one often does or is forced to do, may reduce considerably the information that one can extract from the network structure.
In particular, neglecting link directedness when looking for communities may lead to partial, or even misleading, results.
In the literature there has been no benchmark for directed graphs with communities for a long time. However, we have recently extended the LFR benchmark to directed networks, so we are in the position to evaluate the performance of community detection algorithms in this case.
The presence of directed links is a serious obstacle towards a generalization of an algorithm for community detection. Therefore, very few algorithms currently available are able to handle directed graphs. In the set of methods we consider here, only five can be used as well for directed networks:
- Fast greedy modularity optimization by Clauset, Newman and Moore (Clauset et al.)
- Exhaustive modularity optimization via simulated annealing (simulated annealing)
- Cfinder
- Dynamic algorithm by Rosvall and Bergstrom (Infomap)
- Expectation-maximization algorithm by Newman and Leicht (EM)
The EM method, in its original definition, has actually problems to deal with directed graphs. We present here a comparison of the performances of two methods, exhaustive modularity optimization via simulated annealing and Infomap.
Here the topological mixing parameter μ t refers to the indegree of the nodes, which are distributed according to a power law as in the original undirected benchmark, while the outdegree is kept constant for all nodes, a choice made to avoid an unnecessary proliferation of input parameters.
Again, we considered two different network sizes and ranges for the community size. The other input parameters for the benchmark are the same that we have given previously.
As expected, modularity optimization shows the same limits that emerged in the previous section. On the other hand, the performance of Infomap is still very good.
In this section we focus on undirected graphs with weighted links. Weights are also precious sources of information. Just as in the case of link directedness above, neglecting weights may imply a significant limitation of the information on a graph’s properties, concealing features of real systems which may be very important and not deducible from the mere topology.
Ideally, one should exploit the information from both topology and weights for a reliable analysis of a network. The LFR benchmark has been extended to weighted graphs as well.
Now there are two mixing parameters, one for topology, which is the same μ t we have defined and used so far, and the other for the weights, μw , which is the weighted counter-part of μt, i.e. it expresses the fraction of the strength of the node that lies on links connecting the node to the nodes outside its community, with respect to the total strength of the node.
We remind that the strength of the node is the sum of the weights of its links. Moreover, there is an additional parameter, i.e. the exponent of the distribution for the strength: we have set it to 1.5 for all realizations. All other parameters are the same as the ones specified in the other sections.
We consider only three methods:
- Dynamic algorithm by Rosvall and Bergstrom (Infomap);
- Markov Cluster Algorithm (MCL);
- Exhaustive modularity optimization via simulated annealing (simulated annealing);
The other methods have no weighted counterpart or the code for the weighted version was not available.
Four choices are used: two choices for the topological mixing parameter μt and the two usual ranges of small (S) and big (B) communities that we have used so far. The network size is 5000 nodes in each case.
The Infomap by Rosvall and Bergstrom has, once more, a remarkable performance, although it worsens if communities are topologically more mixed (higher μt) and larger in size (B).
The MCL has a fair performance only in one case, for μt = 0.5 and small communities, whereas in the other extreme of big topological mixture and big communities it fails for any value of μw. Modularity optimization seems to be more sensitive to the community size than to the other parameters.
The fact that communities in real systems often overlap has attracted a lot of attention in the last years, leading to the creation of new algorithms able to deal with this special circumstance, starting from the first work by Palla et al.
Meanwhile, a few methods have been developed, but none of them has been thoroughly tested, except on a bunch of specific networks taken from the real world. Indeed, there have been no suitable benchmark graphs with overlapping community structure, until recently. In particular, the LFR benchmark has been extended to the case of overlapping communities, and we use it here.
Of our set of algorithms, only the Cfinder is able to find overlapping communities. In principle also the EM method assigns to each node the probability that it belongs to any community, but then one would need a criterion to define which, among such probability values, is significant and shall be taken or is not significant and shall be neglected. For this reason we report the results of tests carried out with the Cfinder only.
The topological mixing parameter μt is fixed and one varies the fraction of overlapping nodes between communities. We have run the Cfinder for different types of k-cliques (k indicates the number of nodes of the clique), with k = 3, 4, 5, 6.
In general, we notice that triangles (k = 3) yield the worst performance, whereas 4- and 5-cliques give better results. The algorithm performs better when communities are (on average) smaller and networks larger in size.
We have carried out a comparative analysis of the performances of algorithms for community detection on various graphs: the GN and LFR benchmarks and random graphs.
Link direction, weights and the possibility for communities to overlap have been taken into account in dedicated tests.
We conclude that the Infomap method by Rosvall and Bergstrom is the best performing on the set of benchmarks we have examined here. In particular, its results on the LFR benchmark graphs, which are much more difficult to examine than the GN benchmark graphs, are encouraging about the reliability of the method in applications to real graphs. Among the other things, the method can be applied to weighted and directed graphs as well, with excellent performances, so it has a large spectrum of potential applications.
The algorithms by Blondel et al. and by Ronhovde and Nussinov (RN) also look very good from our analysis and could be used as well. In fact, for a study of the community structure in real graphs, one could think of using all three methods, to be able to extract some algorithm-independent information.
Furthermore, these methods have a low computational complexity, so one could use them on graphs with millions of nodes and links. On the other hand, the algorithms are not able to account for overlapping communities, so they need to be properly refined to deal with this possibility, which is common in many real systems.
One may object that, despite the features planted in the LFR benchmark, i. e. the fat-tailed distributions of degree and community size, which are actually observed in real networks, our artificial graphs are still different from real systems.
For instance, the clustering coefficient of the LFR benchmark is very low, due to the very small number of triangles, whereas real networks are characterized by many triangles and consequently a high clustering coefficient.
On the one hand the GN benchmark also has very few triangles and low clustering coefficient (the LFR benchmark is just a generalization of the GN benchmark), nevertheless people have used it extensively for testing algorithms.
On the other hand, nothing forbids to modify the building mechanism of the LFR benchmark so that it does include triangles. This is actually a potentially interesting improvement of the benchmark, that deserves some attention in the future.
Another important remark is in order. Our whole analysis has made use of graphs with a “flat” community structure, without hierarchy. Many real networks instead have a hierarchical community structure, with communities inside other communities.
Good methods must be able to understand when a network has no communities, a flat or a hierarchical community structure. For an analysis of this kind we would need hierarchical benchmarks.
There is actually a hierarchical version of the GN benchmark, not yet one of the LFR benchmark, which is sorely needed. Methods to find communities in multipartite graphs have yet to be tested as well.