Home - othercriteria/StochasticBlockmodel GitHub Wiki

Much of the most interesting Big Data out there is relational. That is, it describes both a collection of entities and the connections between them. Graph theory gives us a natural language for talking about such data; entities are represented as nodes and their connections as edges. For example:

  • Bibliometrics Co-authorship networks represent authors as nodes and place an undirected edges between authors who collaborated on a paper. Citation networks represent papers as nodes and place a directed edge from a paper to all the papers it cites.
  • Connectome Neurons are represented as nodes and presynaptic neurons send directed edges to postsynaptic neurons, mirroring the flow of information. A human connectome is a long way off but we already have the C. elegans connectome.
  • Social networks People (actors in sociological jargon) are represented as nodes. The edges between them (links) can represent anything from mere interaction to self-reported friendship to professional collaboration. The networks can be real or fictional.

As these examples suggests, there is more relational data than just the collection of nodes and the collection of edges. We don't need to consult a co-authorship network to know that there isn't an edge between Paul Erdős and Leonard Euler, no matter how prolific each was; we know the years in which each was writing papers and we can check that they do not overlap. We may know the location of neurons in 2-d or 3-d coordinates, or more coarsely just to the Brodmann area. A grade school student social network might exhibit most of its edges between students who spend their day in the same class room. Boys and girls might have different patterns of friendship.

I'll be calling data such as the active years of authors, location of neurons, gender of students, etc., covariates (it also gets called metadata in some contexts but this seems vaguely improper when applied to people). Covariates seem to naturally enter in at the node level. It is difficult to come up with edge level covariates that don't arise as a function of node level covariates evaluated for every (directed) pair of nodes. A contrived counterexample would be a randomized edge level intervention, e.g., I randomly pick pairs of my friends to introduce to each other.

Much of network analysis has focused on descriptive tasks: characterizing the largest connected component, calculating metrics like the diameter or average clustering coefficient, looking for power laws in the degree distribution, searching for motifs that appear more often than expected by chance, etc. However, the presence of covariates that explain, at least partially, the pattern of edges observed in a network opens up the possibility for prediction, with its associated inferential challenges.

Notation

Let Aij for 1 ≤ i, jN be the directed adjacency matrix for a network with N nodes. Explicitly, Aij equals 1 if there is an edge from node i to node j and equals 0 otherwise. Suppose there are B edge covariates; these are concatenated into xijb for 1 ≤ b ≤ B. The notation xij is used to denote the B × 1 column vector of covariates associated with the edge from node i to node j. Abusing notation somewhat, I write xib to denote a node covariate used to produce an edge covariate via

[ x_{ijb} = f_b(x_{ib}, x_{jb}). ]

The models I consider have the general form

[ A_{ij} \stackrel{\textrm{iid}}{\sim} \textrm{Bernoulli}(P_{ij}). ]

This might strike the reader as extremely restrictive compared to the exponential random graph models more commonly associated with analysis of network data. However, there are serious challenges in using ERGMs for prediction, as a model fit on a subnetwork cannot be extrapolated to the entire network except under a dyadic independence assumption that is only slightly weaker than the above.

I will generally suppose that Pij is generated by a model of the following form

[ \textrm{logit},P_{ij} = \Theta_{z_i,z_j} + \stackrel{\rightharpoonup}{\alpha}_i + \stackrel{\leftharpoonup}{\alpha}j + \beta \cdot x{ij} + \kappa, ]

where logit p = log(p/(1-p)) is the function that maps probabilities to log-odds and logit-1 x = 1/(1+exp(-x)) is its inverse. These parameters and variables have reasonably clear meanings.

  • Θcd for 1 ≤ c, dK describe block level effects and zi gives the block assignment for node i.
  • ( \stackrel{\rightharpoonup}{\alpha}_i ) describes the tendency of node i to send out edges; ( \stackrel{\leftharpoonup}{\alpha}_j ) describes the tendency of node j to receive edges. It is sometimes useful to think of these as unobserved/latent node covariates. The model is called stationary when ( \stackrel{\rightharpoonup}{\alpha} = \stackrel{\leftharpoonup}{\alpha} = 0 ) and nonstationary (or Rasch) otherwise. (I'll also use the notation αout and αin, respectively, for inline expressions.)
  • βb is the describes the effect of covariate b.
  • κ is the network level baseline propensity for edges to occur.

From this point on, I'll assume that Θ = 0 and will not bother to specify z, that is, I'll assume that there is no block structure present (which is somewhat unfortunate given the name of the project). This is surprisingly not much of a loss of generality. One simple method of learning the Θ ≠ 0 model is a stochastic EM procedure that alternates sampling of z conditioned on all other parameters and the maximization by convex optimization of those other parameters.

Inference

Convex optimization, convex optimization with gradient, logistic regression, regularized logistic regression.

Convex optimization

Requires O(N2) memory to represent Pij in the negative log-likelihood calculation. In principle, only O(1) memory is needed but implementing this would require switching from vectorized to much slower iterative computation.

Logistic regression

Requires O(N2B) memory for the stationary model and O(N3B) memory for the nonstationary model to represent the design matrix, making this impractical even for moderately large problems. It may be possible to bring the latter case back to the former case in terms of memory by using a sparse representation of the design matrix (and assuming sparse covariates), but the available logistic regression packages don't support this.

Design principles

My current goal is to scale these tools to big problems (N ∼ 10000) but not to Big problems (N > 100000, many dense covariates). This will allow me to keep at least one dense N × N matrix in memory, which is essential for working directly with Pij, which is dense even for stationary models with sparse covariates.

Module Organization

This is intended not as a full API specification but rather as a demonstration of just enough of the design to see the logic of the organization.

Network

The Network class in the Network module is responsible for representing the nodes, edges, and covariates of a network.

Network

Upon instantiation, a Network instance has data attributes N giving the network size, names giving the node names, and network giving the (sparse) adjacency data. It also has data attributes node_covariates and edge_covariates giving the mapping from covariate names to NodeCovariate and EdgeCovariate objects, respectively.

A Network instance is created empty, and is typically populated with nodes and edges either from a GEXF file with the network_from_file_gexf(path) method, an edge list with the network_from_edges(edges) method, or an IndependentBernoulli model instance with the generate(model, *opts) method.

The primary convenience a Network instance offers is the subnetwork(inds) method, which returns the subnetwork induced by an array of indices.

The methods new_node_covariate(name) and new_edge_covariate(name) return references to their respective newly created NodeCovariate and EdgeCovariate instances.

Covariate

One reason for the apparent overengineering of this code is heterogeneity of the sources of network data. Adjacency information may come from a different source than covariate information, so some care must be taken to make sure that a particular covariate component is associated with the right node. The approach taken here is to initialize NodeCovariate and EdgeCovariate with arrays of node names from the Network they are attached to. The order of the internal representation of the covariate is then matched to this order. This introduces some overhead at the time of covariate population, but not during ordinary use.

NodeCovariate

NodeCovariate uses a dense internal representation as this only requires O(N) memory. An instance is typically populated by the from_pairs(names, values) method which takes parallel lists of names and covariate values.

EdgeCovariate

EdgeCovariate uses a sparse internal representation as it

Models

IndependentBernoulli

Stationary(IndependentBernoulli)

StationaryLogistic(Stationary)

StationaryLogisticMargins(StationaryLogistic)

NonstationaryLogistic(StationaryLogistic)

Experiment

RandomSubnetwork

BinaryMatrix

Although this is a utility module, it contains a fast approximate sampler for binary matrices with fixed margins, approximate_from_margins_weights(r, c, w), that may be of general interest and utility.

Results

Applications

Simple example of the code in action

Consistency experiments

SCP Series network analysis

⚠️ **GitHub.com Fallback** ⚠️