API Use Cases - TeamCohen/ProPPR GitHub Wiki

Use Cases

Incorporate statically-generated weights in facts, features

This is useful for things like text categorization where you want ProPPR to respond to something like TF/IDF on the link between a document and a word.

Here's an example, which you can also see in examples/walkthru/textcat:

# Simple program for text classification, illustrating how to attach a
# classifier to a ProPPR rule.

# Pick a label Y for X, and decide if it's a good classification by
# calling ab_classify.

predict(X,Y) :- isLabel(Y), ab_classify(X,Y).

# "abduce" (i.e., guess at) a classification Y for document X. The
# antecedent of the rule is empty, so it always succeeds, but the
# weight for this rule will be based on features generated by the
# annotation { f(W,Y): hasWord(X,W) } -- ie, the words in document X
# will be paired with a label, and used as features.  Note that the
# weight of this rule will compete with the weight of the implicit
# 'reset' rule.

ab_classify(X,Y) :- { f#(Word,Y,Weight): hasWord#(X,Word,Weight) }.

The # denotes a weighted relation. In weighted relations, the last argument is used to stand in for the weight of the fact, and does not participate in the final arity (so in your learned-params file you'll only have f(house,pos) and not a thousand f(house,pos,23.12344356) with different weights in).

The weights are specified in a db file. In this case, it's a graph file:

$ cat toycorpus.graph
hasWord train00001      a       0.0589
hasWord train00001      house   0.5005
hasWord train00001      pricy   0.5005
hasWord train00001      doll    0.7040
hasWord train00002      a       0.0510
hasWord train00002      fire    0.6097
hasWord train00002      little  0.4334
hasWord train00002      truck   0.6097
...

ProPPR knows that the last element on the line must be a weight, since .graph files only support arity-2 relations.

You can also specify weights on facts of arbitrary arity in a .facts file, but you have to tell ProPPR explicitly that the last element is a weight and not part of the relation. For simplicity we use the same syntax as in the rules file, by adding a # to the end of the functor:

$ cat toycorpus.facts
hasWord# train00001      a       0.0589
hasWord# train00001      house   0.5005
hasWord# train00001      pricy   0.5005
hasWord# train00001      doll    0.7040
hasWord# train00002      a       0.0510
hasWord# train00002      fire    0.6097
hasWord# train00002      little  0.4334
hasWord# train00002      truck   0.6097
...

Weights are supported in facts, graph, and sparsegraph plugins as of commit ce79246cd.

Adapt duplicate facts checker to larger databases

Duplicate lines in a fact file show up in proof graphs as edges with increased weight, which is usually not what you meant. By default, FactsPlugin (*.facts) and GraphPlugin (*.graph) throw each fact into a bloom filter with a false positive probability of 0.00001 and an estimated size of 1,000,000. This lets them easily check whether a fact has been seen before, without having to store each fact in a lookup-friendly format. If a duplicate fact is detected, it prints an error message and skips the fact. In theory, this means a false positive can result in an incomplete database, but in practice we've yet to run in to this issue.

If you load more then 1M facts, it'll print a message warning you about a greater likelihood of false positives.

You can increase the size of the bloom filter to better accommodate large fact databases using the --duplicateCheck option: just set to the number of lines in your fact file.

Turn off duplicate checking entirely by setting --duplicateCheck -1.

Write outputs to disk as they are finished, not as they were read

By default, QueryAnswerer and Grounder write output files so that the nth example in the query file is the nth example in the solutions/grounded file, and so on. If your examples vary in complexity, sometimes this causes a pile of finished examples to sit around in memory waiting for a slow example to finish. For example, if query 1 is very slow to compute, and queries 2-16 are very fast to compute, a 16-thread run will begin work on all 16 examples. 15 of the threads will finish, and continue computing queries later on in the file, but nothing will be written to disk until query 1 finishes. If you don't need your output to be in the same order as the query file, you can save memory by enabling reordering.

Command line option:

--order reorder

How it works:

The Multithreading harness uses the Futures pattern to manage tasks. Each query generates a "transformer" task and a "cleanup" task, where the output of the transformer is the input to the cleanup. The transformer pool uses #nthreads threads (specified using --threads), and the cleanup pool uses only one thread, since letting multiple threads write to disk is problematic. Java guarantees that tasks will be picked up in the order in which they were added to the pool. By default, the cleanup task blocks until the transformer has finished. With --order reorder, the cleanup task waits a maximum of 20ms for the transformer to finish, and if it hasn't, it resubmits the cleanup job, placing it at the end of the queue. This frees the cleanup thread to proceed to the next example.

Trace feature weight behavior during training

Command line option:

--srw "traceFeature=db(LightweightGraphPlugin,webkb.graph)[:...]"

log4j spec:

log4j.logger.edu.cmu.ml.proppr.learn.SRW=INFO
log4j.logger.edu.cmu.ml.proppr.learn.PosNegLossTrainedSRW=INFO

Example log4j appender:

log4j.appender.consoleout.layout.ConversionPattern=%5p [%c{1}] %t %m%n

where 5p=log level, c=class, t=thread, m=message, n=newline

Output syntax:

reg: lists regularization component of gradient for traced feature.
INFO [PosNegLossTrainedSRW] transformer-1 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
logP: lists log-loss, positive-label component of gradient for traced feature.
INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logP -7.394501070610791E-4 / 0.022347119637303438 = -0.0330892803664386
logN: lists log-loss, negative-label component of gradient for traced feature.
INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.389713595330534E-4 / 0.9776673487199024 = 7.558515281303167E-4
was: lists old parameter value for traced feature.
INFO [SRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) was 1.0060924029682254
+=: lists total gradient and new parameter value for traced feature.
INFO [SRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) += 0.02653948565577576 = 1.0326318886240011

Sample output:

 INFO [Trainer] main 
edu.cmu.ml.proppr.util.ModuleConfiguration
    queries file: train_no_cornell.data.grounded
     params file: cornell.params
          Walker: edu.cmu.ml.proppr.learn.L2PosNegLossTrainedSRW
Weighting Scheme: edu.cmu.ml.proppr.learn.tools.ReLUWeightingScheme
           Alpha: 0.1
         Epsilon: 1.0E-4
       Max depth: 5
        Strategy: exception

 INFO [Trainer] main Training model parameters on train_no_cornell.data.grounded...
 INFO [Trainer] main epoch 1 ...
 INFO [PosNegLossTrainedSRW] transformer-1 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-2 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-3 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-4 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-5 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-7 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-8 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-9 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-11 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-10 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-12 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-14 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-16 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-15 trace db(LightweightGraphPlugin,webkb.graph) reg 0.002012184805936451
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logP -7.394501070610791E-4 / 0.022347119637303438 = -0.0330892803664386
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.389713595330534E-4 / 0.9776673487199024 = 7.558515281303167E-4
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.395496387465339E-4 / 0.9776498723889857 = 7.56456538923664E-4
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.391368182232194E-4 / 0.9776623483485334 = 7.560246331177311E-4
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.393253306708433E-4 / 0.9776566512633886 = 7.562218593976129E-4
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.395527564385326E-4 / 0.9776497781683812 = 7.564598007929574E-4
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 7.396824994997896E-4 / 0.9776458571685945 = 7.565955443641099E-4
 INFO [SRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) was 1.0060924029682254
 INFO [SRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) += 0.02653948565577576 = 1.0326318886240011
 INFO [PosNegLossTrainedSRW] transformer-13 trace db(LightweightGraphPlugin,webkb.graph) reg 8.571428571428571E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logP -6.585123213126203E-4 / 0.022450301786658716 = -0.029332003087100898
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.582897112534067E-4 / 0.9775573496978436 = 6.734026514729593E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.566761957809852E-4 / 0.9775676082696231 = 6.71745043745217E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.580624930792297E-4 / 0.9775542233663602 = 6.731723697260384E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.580843757838968E-4 / 0.9775641812563434 = 6.731878974310839E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.565016397595402E-4 / 0.9775735812671842 = 6.715623788733601E-4
 INFO [PosNegLossTrainedSRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) logN 1.0 * 6.567307835602201E-4 / 0.9775657452003786 = 6.718021644934022E-4
 INFO [SRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) was 0.42857142857142855
 INFO [SRW] transformer-6 trace db(LightweightGraphPlugin,webkb.graph) += 0.02328494577542239 = 0.45185637434685094