Community Detection (Wikipedia) - SergiuTripon/msc-thesis-na-epsrc GitHub Wiki
Home ▸ Literature Survey ▸ Community Detection (Wikipedia)
URL: https://en.wikipedia.org/wiki/Community_structure
- 1. Properties
- 2. Applications
- 3. Algorithms for finding communities
- 4. Testing methods of finding communities algorithms
A network is said to have community structure if the nodes of the network can be easily grouped into (potentially overlapping) sets of nodes such that each set of nodes is densely connected internally.
In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed.
The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities.
1. Properties
In the context of networks, community structure refers to the occurrence of groups of nodes in a network that are more densely connected internally than with the rest of the network. This inhomogeneity of connections suggests that the network has certain natural divisions within it.
Communities are often defined in terms of the partition of the set of vertices, that is each node is put into one and only one community. This is a useful simplification and most community detection methods find this type of community structure.
However, in some cases a better representation could be one where vertices are in more than one community. This might happen in a social network where each vertex represents a person, and the communities represent the different groups of friends: one community for family, another community for co-workers, one for friends in the same sports club, and so on.
Some networks may not have any meaningful community structure. Many basic network models, for example, such as the random graph and the Barabási-Albert model, do not display community structure.
2. Applications
Community structures are quite common in real networks. Social networks include community groups (the origin of the term, in fact) based on common location, interests, occupation, etc. Metabolic networks have communities based on functional groupings. Citation networks form communities by research topic.
Finding communities within an arbitrary network can be a computationally difficult task. The number of communities, if any, within the network is typically unknown and the communities are often of unequal size and/or density. Despite these difficulties, however, several methods for community finding have been developed and employed with varying levels of success.
- one of the oldest algorithms for dividing networks into parts;
- variants include ratio cut and normalized cut;
- used in load balancing for parallel computing in order to minimize communication between processor nodes
- divides the network into a predetermined number of parts, usually of approximately the same size, chosen such that the number of edges between groups is minimized;
- works well in many of the applications for which it was originally intended but is less than ideal for finding community structure in general networks since it will find communities regardless of whether they are implicit in the structure, and it will find only a fixed number of them;
- a similarity measure is defined which quantifies some (usually topological) type of similarity between node pairs;
- commonly used measures include the cosine similarity, the Jaccard index, and the Hamming distance between rows of the adjacency matrix;
- similar nodes are grouped into communities according to this measure
- several common schemes for performing the grouping, the two simplest being single-linkage and complete linkage clustering;
- single-linkage clustering: two groups are considered separate communities if and only if all pairs of nodes in different groups have similarity lower than a given threshold;
- complete linkage clustering: all nodes within every group have similarity greater than a threshold;
- novel approach involves the use of various similarity or dissimilarity measures, combined through convex sums, which has greatly improved the performance of this kind of methodology;
- identifies edges in a network that lie between communities and then removes them, leaving behind just the communities themselves;
- identification is performed by employing the graph-theoretic measure betweenness centrality, which assigns a number to each edge which is large if the edge lies "between" many pairs of nodes;
- returns results of reasonable quality and is popular because it has been implemented in a number of standard software packages;
- it also runs slowly, taking time O(m2n) on a network of n vertices and m edges, making it impractical for networks of more than a few thousand nodes;
- uses the Modularity benefit function;
- modularity measures the quality of a particular division of a network into communities;
- detects communities by searching over possible divisions of a network for one or more that have particularly high modularity;
- exhaustive search over all possible divisions is usually intractable, practical algorithms are based on approximate optimization methods such as greedy algorithms, simulated annealing, or spectral optimization, with different approaches offering different balances between speed and accuracy;
- a popular modularity maximization approach is the Louvain method, which iteratively optimizes local communities until global modularity can no longer be improved given perturbations to the current community state;
- currently, the best modularity maximization algorithm (winner of the 10th DIMACS Implementation Challenge) is an iterative ensemble algorithm;
- usefulness of modularity optimization is questionable, as it has been shown that modularity optimization often fails to detect clusters smaller than some scale, depending on the size of the network (resolution limit);
- on the other hand the landscape of modularity values is characterized by a huge degeneracy of partitions with high modularity, close to the absolute maximum, which may be very different from each other;
- attempts to fit a generative model to the network data, which encodes the community structure;
- overall advantage of this approach compared to the alternatives is its more principled nature, and the capacity to inherently address issues of statistical significance;
- most methods in the literature are based on the stochastic block model as well as variants including mixed membership, degree-correction and hierarchical structures;
- model selection can be performed using principled approaches such as minimum description length and Bayesian model selection;
- currently many algorithms exist to perform efficient inference of stochastic block models, including belief propagation and agglomerative Monte Carlo;
- differently from approaches which attempt to cluster the network given an ad-hoc quality function, this class of methods is based on generative models which not only serve as a description of the large-scale structure of the network, but also can be used to generalize the data, and predict the occurrence of missing or spurious links in the network;
- cliques are subgraphs in which every node is connected to every other node in the clique;
- as nodes can not be more tightly connected than this, it is not surprising that there are many approaches to community detection in networks based on the detection of cliques in a graph and the analysis of how these overlap;
- as a node can be a member of more than one clique, a node can be a member of more than one community in these methods giving an overlapping community structure;
- one approach is to find the maximal cliques, that is find the cliques which are not the subgraph of any other clique;
- the classic algorithm to find these is the Bron-Kerbosch algorithm;
- the overlap of these can be used to define communities in several ways;
- the simplest is to consider only maximal cliques bigger than a minimum size (number of nodes);
- the union of these cliques then defines a subgraph whose components (disconnected parts) then define communities;
- the alternative approach to is to use cliques of fixed size, k;
- the overlap of these can be used to define a type of k-regular hypergraph or a structure which is a generalisation of the line graph (the case when k=2) known as a Clique graph;
The evaluation of algorithms, to detect which are better at detecting community structure, is still an open question. It must be based on analyses of networks of known structure.
- divides the network into four equally-sized groups (usually of 32 nodes each) and the probabilities of connection within and between groups varied to create more or less challenging structures for the detection algorithm;
- such benchmark graphs are a special case of the planted l-partition model of Condon and Karp, or more generally of "stochastic block models," a general class of random network models containing community structure;
4.2 LFR benchmark
- proposed by Lancichinetti et al.;
- allows for varying group sizes and nontrivial degree distributions;
- an extension of the "four groups" benchmark that includes heterogeneous distributions of node degree and community size, making it a more severe test of community detection methods;
- start with a network of well-defined communities;
- then, this structure is degraded by rewiring or removing links and it gets harder and harder for the algorithms to detect the original partition;
- at the end, the network reaches a point where it is essentially random;
- this kind of benchmark may be called "open";
- the performance on these benchmarks is evaluated by measures such as normalized mutual information or variation of information;
- they compare the solution obtained by an algorithm with the original community structure, evaluating the similarity of both partitions;