Community structure in social and biological networks - SergiuTripon/msc-thesis-na-epsrc GitHub Wiki

HomeLiterature Survey ▸ Community structure in social and biological networks


Year: 2002
Authors: M. Girvan, M. E. J. Newman
File: girvan_newman_2002.pdf


Contents

Authors propose a method for detecting community structure and apply it to the study of a number of different social and biological networks.

The ability to detect community structure in a network could clearly have practical applications.

  • communities in a social network might represent real social groupings, perhaps by interest or background;
  • communities in a citation network might represent related papers on a single topic;
  • communities in a metabolic network might represent cycles and other functional groupings;
  • communities on the web might represent pages on related topics

1. Detecting Community Structure

1.1 Hierarchical clustering

  • One first calculates a weight Wij for every pair i,j of vertices in the network, which represents in some sense how closely connected the vertices are;
  • Then, one takes the n vertices in the network, with no edges between them, and adds edges between pairs one by one in order of their weights, starting with the pair with the strongest weight and progressing to the weakest;
  • As edges are added, the resulting graph shows a nested set of increasingly large components (connected subsets of vertices), which are taken to be the communities;

Back to Top

2. Edge "Betweenness" and Community Structure

Instead of trying to construct a measure that tells us which edges are most central to communities, we focus instead on those edges that are least central, the edges that are most "between" communities.

Rather than constructing communities by adding the strongest edges to an initially empty vertex set, we construct them by progressively removing edges from the original graph.

  • edge betweenness of an edge is defined as the number of shortest paths between pairs of vertices that run along it;
  • if there is more than one shortest path between a pair of vertices, each path is given equal weight such that the total weight of all of the paths is unity;
  • if a network contains communities or groups that are only loosely connected by a few intergroup edges, then all shortest paths between different communities must go along one of these few edges;
  • thus, the edges connecting communities will have high edge betweenness;
  • by removing these edges, we separate groups from one another and so reveal the underlying community structure of the graph;

The algorithm proposed for identifying communities is simply stated as follows:

  1. Calculate the betweenness for all edges in the network;
  2. Remove the edge with the highest betweenness;
  3. Recalculate betweennesses for all edges affected by the removal;
  4. Repeat from step 2 until no edges remain;

The algorithm proposed has been implemented as follows:

  • uses the fast algorithm of Newman to calculate the betweennesses, which calculates betweenness for all m edges in a graph of n vertices in time O(mn);
  • because this calculation has to be repeated once for the removal of each edge, the entire algorithm runs in worst-case time O(m2n);
  • however, after the removal of each edge, we only have to recalculate the betweennesses of those edges that were affected by the removal, which is at most only those in the same component as the removed edge.
  • this means that running time may be better than worst-case for networks with strong community structure (those that rapidly break up into separate components after the first few iterations of the algorithm);

To try to reduce the running time of the algorithm further, one might be tempted to calculate the betweennesses of all edges only once and then remove them in order of decreasing betweenness.

We find, however, that this strategy does not work well, because if two communities are connected by more than one edge, then there is no guarantee that all of those edges will have high betweenness-we only know that at least one of them will.

By recalculating betweennesses after the removal of each edge we ensure that at least one of the remaining edges between two communities will always have a high value.

Back to Top

3. Tests of the Method

The authors present a number of tests of the algorithm on computer-generated graphs and on real-world networks for which the community structure is already known. In each case, they find that the algorithm reliably detects the known structure.

3.1 Computer-Generated Graphs

  • algorithm applied to a large set of artificial, computer-generated graphs;
  • each graph was constructed with 128 vertices divided into four communities of 32 vertices each;
  • edges were placed between vertex pairs independently at random, with probability Pin for vertices belonging to the same community and Pout for vertices in different communities, with Pout < Pin;
  • the probabilities were chosen so as to keep the average degree z of a vertex equal to 16;

The algorithm performs nearly perfectly when zout < 6, classifying 90% or more of the vertices correctly. Only for zout >= 6 does the fraction correctly classified start to fall off substantially. In other words, the algorithm performs very well almost to the point at which each vertex has as many intercommunity as intracommunity connections.

Back to Top


3.2 Zachary’s Karate Club Study

  • 34 members of a karate club were observed over a period of 2 years;
  • during the course of the study, a disagreement developed between the administrator of the club and the club’s instructor;
  • this ultimately resulted in the instructor’s leaving and starting a new club, taking about a half of the original club’s members with him;
  • Zachary constructed a network of friendships between members of the club, using a variety of measures to estimate the strength of ties between individuals;
  • this test used a simple unweighted version of his network and the algorithm was applied to it in an attempt to identify the factions involved in the split of club;

The most fundamental split in the network is the first one at the top of the tree, which divides the network into two groups of roughly equal size. This split corresponds almost perfectly with the actual division of the club members after the break-up, as revealed by which club they attended afterwards.

Only one node, node 3, is classified incorrectly. In other words, the application of the algorithm to the empirically observed network of friendships is a good predictor of the subsequent social evolution of the group.

For comparison we also have performed a traditional hierarchical clustering based on edge-independent paths for the karate club network. This method correctly identifies the core vertex sets {1,2,3} and {33,34} of the two communities, but otherwise there appears to be little correlation with the actual split of the club, indicating once again that this method is significantly more accurate and sensitive than the standard method.

Back to Top


3.3 College Football

  • network is a representation of the schedule of Division I games for the 2000 season;
  • vertices in the graph represent teams (identified by their college names);
  • edges represent regular-season games between the two teams they connect;
  • network incorporates a known community structure;
  • the teams are divided into conferences containing around 8–12 teams each. Games are more frequent between members of the same conference than between members of different conferences, with teams playing an average of about seven intraconference games and four interconference games in the 2000 season;
  • interconference play is not uniformly distributed;
  • teams that are geographically close to one another but belong to different conferences are more likely to play one another than teams separated by large geographic distances.

Applying the algorithm to this network, we find that it identifies the conference structure with a high degree of success.

Almost all teams are correctly grouped with the other teams in their conference. There are a few independent teams that do not belong to any conference-these tend to be grouped with the conference with which they are most closely associated.

The few cases in which the algorithm seems to fail actually correspond to nuances in the scheduling of games. For example, the Sunbelt Conference is broken into two pieces and grouped with members of the Western Athletic Conference. This happens because the Sunbelt teams played nearly as many games against Western Athletic teams as they did against teams in their own conference. They also played quite a large fraction of their interconference games against Mid-American teams.

Naturally, our algorithm fails in cases like this where the network structure genuinely does not correspond to the conference structure. In all other respects, however, it performs remarkably well.

Back to Top

4. Applications

4.1 Collaboration Network

  • the community-finding method applied to a collaboration network of scientists at the Santa Fe Institute, an interdisciplinary research center in Santa Fe, New Mexico;
  • 271 vertices in this network represent scientists in residence at the Santa Fe Institute during any part of calendar year 1999 or 2000 and their collaborators;
  • an edge is drawn between a pair of scientists if they co-authored one or more articles during the same time period;
  • the network includes all journal and book publications by the scientists involved, along with all papers that appeared in the institute’s technical reports series;
  • on average, each scientist coauthored articles with approximately five others;

We find that the algorithm splits the network into a few strong communities, with the divisions running principally along disciplinary lines. The community indicated by diamonds is the least well defined and represents a group of scientists using agent-based models to study problems in economics and traffic flow. The algorithm further divides this group into smaller components that correspond roughly with the split between economics and traffic.

Back to Top


4.2 Food Web

  • the algorithm was also applied to a food web of marine organisms living in the Chesapeake Bay, a large estuary on the east coast of the United States;
  • the network was originally compiled by Baird and Ulanowicz and contains 33 vertices representing the ecosystem’s most prominent taxa;
  • most taxa are represented at the species or genus level, although some vertices represent larger groups of related species;
  • edges between taxa indicate trophic relationships - one taxon feeding on another;
  • although relationships of this kind are inherently directed, we here ignore direction and consider the network to be undirected;

Applying our algorithm to this network, we find two well defined communities of roughly equal size, plus a small number of vertices that belong to neither community.

The split between the two large communities corresponds quite closely with the division between pelagic organisms (those that dwell principally near the surface or in the middle depths of the bay) and benthic organisms (those that dwell near the bottom).

Interestingly, the algorithm includes within each group organisms from a variety of different trophic levels. This finding contrasts with other techniques that have been used to analyze food webs, which tend to cluster taxa according to trophic level rather than habitat. Our results seem to imply that pelagic and benthic organisms in the Chesapeake Bay can be separated into reasonably self-contained ecological subsystems.

The separation is not perfect: a small number of benthic organisms find their way into the pelagic community, presumably indicating that these species play a substantial role in the food chains of their surface-dwelling colleagues. This finding suggests that the simple traditional division of taxa into pelagic or benthic may not be an ideal classification in this case.

Back to Top

⚠️ **GitHub.com Fallback** ⚠️