Factorie - cllu/.rc GitHub Wiki

The factorie package written by Andrew MacCallum is a great tool for research on factor graph. However, like most package coming from research community, the codes are not well documented. This post will log down my exploration of this package.

A simple example

Well, not that simple under the ground. The Gaussian class is a DirectedFamily3 subclass. Each Gaussian variable depends on the mean variable and variance variable. This makes things pretty nifty. See the example in GaussianDemo.sacla:

implicit val model = DirectedModel()
implicit val random = new scala.util.Random(0)
val mean = new DoubleVariable(10)
val variance = new DoubleVariable(1.0)
val data = for (i<-1 to 1000) yield new DoubleVariable :~ Gaussian(mean, variance)

These three lines will automatically generate 1000 Gaussian variables, together with the factors. What that means is you can use the generated data like:

MaximizeGaussianMean(mean, model)

Factors

Factorie implements the factors in a very flexible way, taking advantage of scala's trait functionality. A factor should have three parts: the neighboring variables, the score and the statistics for the neighboring variables.

Take the factor Factor3 as example. A factor of type Factor3 has three neighbors, and can return a score(un-normalized) representing the compatibility of the current world. So one thing is how to compute the score from the three neighbors. Of course, you can directly write codes to do the computation. However, there are some common cases that can be abstracted.

  • If each neighbors can return a Tensor, then you can use TensorFactor3. Two methods need to be implemented: statistics return a Tensor based on the three tensors returned by three neighbors; and statisticsScore compute the final score based on the statistics
    • Based on the TensorFactor3, if the statistics can be computed as the outer product of the three tensors, then you can use TensorFactorStatistics3. Only statisticsScore need to be implemented.
    • Based on the TensorFactor3 again, if the statisticsScore can be computed as dot product between statistics and a given weights, then you can use DotFactor3. In this way, you need to define weights and statistics for the factor.
      • Combine the TensorFactorStatistics3 and DotFactor3, you get DotFactorWithStatistics3

Template

To facilitate the generation of factors, factorie have defined Template as factor template. Corresponding to Factor3, we have Family3 to generate Factor3. One difference here is we now have a TupleFamily3 trait.

  • TensorFamily3 --> TensorFactor3
    • TensorFamilyWithStatistics3 --> TensorFactor3
    • DotFamily3 --> DotFactor3
      • DotFamilyWithStatistics3 --> DotFactorWithStatistics3
  • TupleFamily3, here the statistics is defined as a tuple, whose types are three neighbors' Values.
    • TupleFamilyWithStatistics3 make things even further, the statistics's value will be three neighbors' value.

Use the factor

There are some requirements to define your desired factor graph. Suppose the node in you factor graph is of class Node, during the inference, some fields of the Node instance will change. These fields should be defined as a subclass of *Variable, like BooleanVariable. All these class inherits from trait MutableVar, which defines a function set(newValue:Value)(implicit d:DiffList). The sampling process relies on the DiffList to score proposals and do inference. So what is a DiffList?

DiffList is a list of Diff, which stores the variable it refers to and redo/undo functions. The DiffList tracks all the changes of variables in a proposal, thus we can compute the model score before the proposed changes and after the proposed changes. The DiffList class provides a scoreAndUndo(model:Model) methods, which will compute the model score with and without proposal, and return the score difference. Wait, why is the DiffList responsible for computing the model score and how does it compute? The Model class, which as you may guess is a collection a factors, knows which factor touch(contain) a given variable, and the Diff contains the pointer to the variable, so we can tell the Model to compute the score given all the variables involved in a DiffList.

Notice this raises a problem when creating factors using factor family. The Family class generate one or more Factors using the unroll method, based on the type of involved variable. Each neighbor has the corresponding unroll method, like unroll1, unroll2, unroll3 in a Family3 class. If any of the neighbor variable changes in a proposal, it will generate one factor. Some proposal may change multiple neighbor variables at the same time, thus generates the same factor multiple times. This is solved by defining the touching factors in a Model as a LinkedHashSet and implement the equals method for Factor. In this way, factorie ensures there is no duplicate factors for a proposal.