Tutorial:structureLearning - TeamCohen/ProPPR GitHub Wiki
Warning: Converting this tutorial to use proppr
macros instead of running gradients by hand seems to have lost us ~20 points of MAP on some runs. It seems to be sensitive to the SRW random seed. Regardless, the differences shake out if you use the iterated structure learner (sl-train
) instead of the single-shot version. -katie
Introduction to Structure Learning in ProPPR
In this tutorial you will learn:
- how to be Really Clever with ProPPR features
proppr
syntax for structure learning tasks
This tutorial assumes you are familiar with ProPPR terminology and file types. If you're brand new to ProPPR, you will want to look over the textcat
and labelprop
tutorials.
Table of Contents
- Step 0 - Install ProPPR and set up your path
- Step 1 - Defining the task
- Step 2 - Building the database
- Step 2.1 - Generating the base KB
- Step 2.2 - Generating incomplete KBs
- Step 3 - Understanding the standard structure-learning rules file
- Step 4 - Using structure learning to generate new rules
- Step 6 - Using the generated rules to answer queries
- Concluding Remarks
Step 0: Install ProPPR and set up your path
If you haven't already,
$ git clone [email protected]:TeamCohen/ProPPR.git
$ cd ProPPR
$ ant clean build
& then if you're using a new shell,
$ cd ProPPR
$ . init.sh
We'll be working out of the tutorials/structureLearning
directory.
Step 1: Defining the task
We'll be building a ProPPR dataset for automatically extracting structural patterns in a knowledge base, using the method described in Structure Learning via Parameter Learning (Wang, Mazaitis, & Cohen, CIKM'14). This tutorial will focus on a simple knowledge base completion task. Our database will describe a family tree where some relations have been removed. These relations will then be queried, with ProPPR recovering the data through other structural information in the knowledge base.
The particular way we express a structure learning task in ProPPR is a bit abstract if you've never encountered abductive second-order logic before (and perhaps even if you have), so we'll go slow.
Step 2: Building the database
Step 2.1: Generating the base KB
First, we'll generate the complete family record. We don't know which relations might be easier or harder to recover yet, so we'll leave that to a later script.
The database for this task is extremely simple: each fact expresses a relationship between two people in a family tree. We restrict our vocabulary to the 12 relations wife, husband, mother, father, daughter, son, sister, brother, aunt, uncle, niece, and nephew. We'll generate 2 families and use one for training and the other for testing.
Our training family looks like this:
aunt jennifer charlotte
aunt jennifer colin
aunt margaret charlotte
aunt margaret colin
brother arthur victoria
brother colin charlotte
brother james jennifer
daughter charlotte james
daughter charlotte victoria
daughter jennifer andrew
...
Our testing family looks like this:
aunt angela alfonso
aunt gina alfonso
aunt gina sophia
brother alfonso sophia
brother emilio lucia
brother marco angela
daughter angela pierro
daughter lucia maria
daughter lucia roberto
daughter sophia lucia
...
These relations should be read as "arg1 is the relation of arg2", so that Angela is the aunt of Alfonso and not the other way around.
We've put the facts for both families in the file kinship.cfacts
.
Structure learning is a 2nd-order procedure, meaning that we want to reason over the relations themselves: what does "aunt" mean, and can we express it in terms of other relations? etc. To do this in ProPPR, we need to lift the family facts into the 2nd-order space. Thankfully, the proppr
utility will do this for us:
$ proppr sl-lift kinship.cfacts kinship-2nd.cfacts
INFO:root:ProPPR v2
INFO:root:calling: python ${PROPPR}/scripts/proppr-helpers/structured-gradient.py --com lift --src kinship.cfacts --dst kinship-2nd.cfacts +
INFO:root:second-order version of facts from kinship.cfacts stored in kinship-2nd.cfacts
splitting by [] opts []
targetArgs []
otherArgs []
$ head kinship-2nd.cfacts
rel aunt jennifer charlotte
rel aunt jennifer colin
rel aunt margaret charlotte
rel aunt margaret colin
rel brother arthur victoria
rel brother colin charlotte
rel brother james jennifer
rel daughter charlotte james
rel daughter charlotte victoria
rel daughter jennifer andrew
We've split off the 2nd-order training and test facts into kinship-train.cfacts
and kinship-test.cfacts
respectively.
So what do the queries look like? For the training and testing examples, we'll include a query for each (relation,person) pair in the database, with positive labels for the correct facts. For negative labels, we'll first see if the query's person participates in any facts other than the query's relation, then add those non-relation person-person pairs to the negative set.
For example, let's generate the labels for a query which asks of whom Jennifer is the aunt. We'll use the grep
utility to print out only those facts that Jennifer participates in, to make it easier to identify (by hand) just the facts for which Jennifer is the first argument.
$ grep "jennifer" kinship-train.cfacts
rel aunt jennifer charlotte
rel aunt jennifer colin
rel brother james jennifer
rel daughter jennifer andrew
rel daughter jennifer christine
rel father andrew jennifer
rel husband charles jennifer
rel mother christine jennifer
rel nephew colin jennifer
rel niece charlotte jennifer
rel wife jennifer charles
The positive labels, then, should be for Y=charlotte and colin, since Jennifer is their aunt. The negative labels should be for Y=andrew, christine, and charles, since Jennifer is something to them, but not their aunt.
If we proceed that way for every (relation,person) pair in the training set, we end up with the file kinship-train.examples
. The lines are a bit long for a true sample, but hopefully you get the idea:
interp(i_aunt,andrew,Y) -interp(i_aunt,andrew,christine) -interp(i_aunt,andrew,james)...
interp(i_aunt,arthur,Y) -interp(i_aunt,arthur,charlotte) -interp(i_aunt,arthur,penelope)...
interp(i_aunt,charles,Y) -interp(i_aunt,charles,charlotte) -interp(i_aunt,charles,jennifer)...
interp(i_aunt,charlotte,Y) -interp(i_aunt,charlotte,james) -interp(i_aunt,charlotte,charles)...
interp(i_aunt,christine,Y) -interp(i_aunt,christine,james) -interp(i_aunt,christine,jennifer)...
interp(i_aunt,christopher,Y) -interp(i_aunt,christopher,arthur)...
interp(i_aunt,colin,Y) -interp(i_aunt,colin,charlotte) -interp(i_aunt,colin,james)...
interp(i_aunt,james,Y) -interp(i_aunt,james,charlotte) -interp(i_aunt,james,christine)...
interp(i_aunt,jennifer,Y) +interp(i_aunt,jennifer,charlotte) +interp(i_aunt,jennifer,colin)...
interp(i_aunt,margaret,Y) +interp(i_aunt,margaret,charlotte) +interp(i_aunt,margaret,colin)...
...
First, you'll notice that instead of the raw relation names like aunt
we've used extended names like i_aunt
. This is to make it easier to trace how learned structures progress, later on.
Second, many of the queries have no positive examples; some of these cases are common gender-exclusions (Charles is no one's wife) and some are relations that just happen to be empty for this set (Colin has no children). Older versions of ProPPR had a problem with queries without both positive and negative examples; support for singly-flavored labeled examples started with v2.0.
We perform the same procedure for the test family, resulting in the file kinship-test.examples
.
Step 2.2: Generating incomplete KBs
We happen to know from experiments that competing tools are particularly bad at recovering complimentary pairs of relations when both are missing, such as husband/wife, sister/brother, etc. For this tutorial, we will remove husband/wife; you can easily repeat the experiment for other pairs (or any relation or set or relations).
First we can generate database files which are missing the husband/wife information:
$ grep -v -e "husband" -e "wife" kinship-train.cfacts > k_spouse-train.cfacts
$ grep -v -e "husband" -e "wife" kinship-test.cfacts > k_spouse-test.cfacts
Then we'll generate examples files which include only the husband/wife queries:
$ grep -e "husband" -e "wife" kinship-train.examples > k_spouse-train.examples
$ grep -e "husband" -e "wife" kinship-test.examples > k_spouse-test.examples
Step 3: Understanding the standard structure-learning rules file
We're going to use an abductive program to learn alternate pathways through the KB that express the missing husband/wife relationship. To make that work, the features of our logic program must themselves represent first-order clauses. During the learning step, we will use the gradient of each of those features as an indicator of which clauses are the most useful in expressing the missing relationships.
We support three first-order clauses in the features of our program:
if(P,R)
meaning, P(X,Y) :- R(X,Y)ifInv(P,R)
meaning, P(X,Y) :- R(Y,X)chain(P,R1,R2)
meaning, P(X,Y) :- R1(X,Z),R2(Z,Y)
Each of these clauses corresponds to a pair of rules for answering a query of the form interp(someP,someX,Yunknown)
:
interp(P,X,Y) :- rel(R,X,Y), abduce_if(P,R) {fixedWeight}.
abduce_if(P,R) :- { if(P,R) }.
interp(P,X,Y) :- rel(R,Y,X),abduce_ifInv(P,R) {fixedWeight}.
abduce_ifInv(P,R) :- { ifInv(P,R) }.
interp(P,X,Y) :- rel(R1,X,Z),rel(R2,Z,Y),abduce_chain(P,R1,R2) {fixedWeight}.
abduce_chain(P,R1,R2) :- { chain(P,R1,R2) }.
Here we're using the fixedWeight
feature keyword to tell ProPPR not to train the links generated by the interp
rules.
The general procedure for each pair is to use rel
lookups to find the possible values for Y
, that is, anyone directly related to X
regardless of the relationship. Then we keep the relationship value(s) R
and train whether that clause successfully simulates P
.
These six rules have been stored in the file sg-interp-train.ppr
. The proppr
utility will automatically give you a copy when you use the structure-learning toolkit.
Step 4: Using structure learning to generate new rules
Next, we want to run the program above, and convert the top-ranked features to rules. This used to be a fussy and complicated process, but now we have a macro that does it for you. sl-train1
will make sure you have a copy of sg-interp-train.ppr
, compile the program, ground the queries, compute the gradient, take the features with the largest gradient, and output a new rules file. Whew!
$ proppr sl-train1 k_spouse-train.examples k_spouse-train.cfacts
INFO:root:ProPPR v2
INFO:root:calling: python ${PROPPR}/scripts/proppr-helpers/structured-gradient.py --com sg-train --src k_spouse-train.examples --dst k_spouse-train-learned1.ppr --src2 k_spouse-train.cfacts --stem k_spouse-train +
INFO:root:copied ${PROPPR}/scripts/proppr-helpers/sg-interp-train.ppr to current directory
INFO:root:calling: ${PROPPR}/scripts/proppr compile sg-interp-train.ppr
INFO:root:ProPPR v2
INFO:root:subprocess call options: {'stdout': <open file 'sg-interp-train.wam', mode 'w' at 0x7f47e9a3e150>}
INFO:root:calling: python ${PROPPR}/src/scripts/compiler.py serialize sg-interp-train.ppr
INFO:root:compiled sg-interp-train.ppr to sg-interp-train.wam
INFO:root:calling: ${PROPPR}/scripts/proppr ground k_spouse-train.examples k_spouse-train.examples.grounded --programFiles sg-interp-train.wam:k_spouse-train.cfacts --ternaryIndex true
INFO:root:ProPPR v2
INFO:root:calling: java -cp .:${PROPPR}/conf/:${PROPPR}/bin:${PROPPR}/lib/* edu.cmu.ml.proppr.Grounder --queries k_spouse-train.examples --grounded k_spouse-train.examples.grounded --ternaryIndex true --programFiles sg-interp-train.wam:k_spouse-train.cfacts --fixedWeight fixedWeight
Unrecognized option: --fixedWeight
INFO [Grounder] Resetting grounding statistics...
...
When it's done, the resulting generated rules will be in a file called k_spouse-train-learned1.ppr
:
$ head k_spouse-train-learned1.ppr
interp(i_wife,X,Y) :- rel(mother,X,Z), rel(daughter,Z,Y) {lr_chain(i_wife,mother,daughter)}.
interp(i_husband,X,Y) :- rel(father,X,Z), rel(son,Z,Y) {lr_chain(i_husband,father,son)}.
interp(i_husband,X,Y) :- rel(father,X,Z), rel(daughter,Z,Y) {lr_chain(i_husband,father,daughter)}.
interp(i_husband,X,Y) :- rel(uncle,X,Z), rel(niece,Z,Y) {lr_chain(i_husband,uncle,niece)}.
interp(i_wife,X,Y) :- rel(aunt,X,Z), rel(niece,Z,Y) {lr_chain(i_wife,aunt,niece)}.
interp(i_wife,X,Y) :- rel(aunt,X,Z), rel(nephew,Z,Y) {lr_chain(i_wife,aunt,nephew)}.
interp(i_husband,X,Y) :- rel(uncle,X,Z), rel(nephew,Z,Y) {lr_chain(i_husband,uncle,nephew)}.
interp(i_wife,X,Y) :- rel(mother,X,Z), rel(son,Z,Y) {lr_chain(i_wife,mother,son)}.
interp(i_husband,X,Y) :- rel(mother,X,Z), rel(aunt,Z,Y) {lr_chain(i_husband,mother,aunt)}.
interp(i_wife,X,Y) :- rel(father,X,Z), rel(aunt,Z,Y) {lr_chain(i_wife,father,aunt)}.
If we take a look a the first generated rule:
interp(i_wife,X,Y) :- rel(mother,X,Z), rel(daughter,Z,Y) {lr_chain(i_wife,mother,daughter)}.
It should be read as, "Interpret X to be the 'wife' of Y if X is the mother of Y's daughter," which is pretty good!
Step 6: Using the generated rules to answer queries
Now let's see how well our generated rules work in practice. proppr
has another macro for this, which makes sure the program is compiled and takes care of some edge case rules:
$ proppr sl-answer k_spouse-train.examples k_spouse-train.cfacts k_spouse-train-learned1.ppr
INFO:root:ProPPR v2
INFO:root:calling: ${PROPPR}/scripts/proppr compile k_spouse-train-learned1.ppr
INFO:root:ProPPR v2
INFO:root:subprocess call options: {'stdout': <open file 'k_spouse-train-learned1.wam', mode 'w' at 0x7fb1b9b6b150>}
INFO:root:calling: python ${PROPPR}/src/scripts/compiler.py serialize k_spouse-train-learned1.ppr
INFO:root:compiled k_spouse-train-learned1.ppr to k_spouse-train-learned1.wam
INFO:root:calling: ${PROPPR}/scripts/proppr answer k_spouse-train.examples k_spouse-train.solutions.txt --programFiles k_spouse-train-learned1.wam:k_spouse-train.cfacts --ternaryIndex true
INFO:root:ProPPR v2
INFO:root:calling: java -cp .:${PROPPR}/conf/:${PROPPR}/bin:${PROPPR}/lib/* edu.cmu.ml.proppr.QueryAnswerer --queries k_spouse-train.examples --solutions k_spouse-train.solutions.txt --ternaryIndex true --programFiles k_spouse-train-learned1.wam:k_spouse-train.cfacts --fixedWeight fixedWeight
Unrecognized option: --fixedWeight
edu.cmu.ml.proppr.QueryAnswerer.QueryAnswererConfiguration
queries file: k_spouse-train.examples
solutions file: k_spouse-train.solutions.txt
Duplicate checking: up to 1000000
threads: -1
Prover: edu.cmu.ml.proppr.prove.DprProver
Squashing function: edu.cmu.ml.proppr.learn.tools.ClippedExp
APR Alpha: 0.1
APR Epsilon: 1.0E-4
APR Depth: 20
INFO [QueryAnswerer] Running queries from k_spouse-train.examples; saving results to k_spouse-train.solutions.txt
INFO [QueryAnswerer] Executing Multithreading job: streamer: edu.cmu.ml.proppr.QueryAnswerer.QueryStreamer transformer: null throttle: -1
INFO [QueryAnswerer] Total items: 24
Query-answering time: 175
INFO:root:answers in k_spouse-train.solutions.txt
invokeMyself {'dryRun': False, 'out': ''} ('compile', 'k_spouse-train-learned1.ppr')
invokeMyself {'dryRun': False, 'out': ''} ('answer', 'k_spouse-train.examples', 'k_spouse-train.solutions.txt', '--programFiles', 'k_spouse-train-learned1.wam:k_spouse-train.cfacts', '--ternaryIndex', 'true')
$ proppr eval k_spouse-train
INFO:root:ProPPR v2
INFO:root:calling: python ${PROPPR}/scripts/answermetrics.py --data k_spouse-train.examples --answers k_spouse-train.solutions.txt --metric map
queries 24 answers 52 labeled answers 10
==============================================================================
metric map (MAP): The average precision after each relevant solution is retrieved
. micro: 1.0
. macro: 0.946781305115
Not bad.
Concluding Remarks
There's a lot more you can do with structure learning and ProPPR. The paper covers an iterative version, where you feed the generated rules back into the structure learner program and compute another gradient after training for an additional epoch. You can repeat this process, always training the gradient finder for one epoch further than last time, until the gradient doesn't return any more qualifying features. The iterated structure gradient is necessary for more complex datasets than our toy kinship one, and is covered by a separate macro, sl-train
(as opposed to train1
).
You can also take the generated program, train it on the training data using sl-tune
, and evaluate it on the testing examples. This lets ProPPR determine which generated rules are the most reliable.
If you're at CMU, you can see an example of both of these techniques in the structureLearning regression test. Ask Katie for the path to the group share.