Troubleshooting - TeamCohen/ProPPR GitHub Wiki
If I have train.cfacts, and then after training, I have:
db(FactsPlugin,train.cfacts) 3.5
During testing, we only use test.cfacts in the program, and train.cfacts is not observed.
In this case, will the system understand that the db rule has a weight of 3.5?
Alas, no. If the output params file has
db(FactsPlugin,train.cfacts) 3.5
and you want to use that params file on a dataset with a different file in place of train.cfacts, you'll want to edit the params file. I usually duplicate the feature, so it has
db(FactsPlugin,train.cfacts) 3.5
db(FactsPlugin,test.cfacts) 3.5
and the params file can still be run on the original training set.
Here is a script that will duplicate the training feature to testing, as above, and print to stdout so you can check it:
$ sed '/train.cfacts/ {p;s/train/test/}' params.wts | less
This one will make the same change and save it back to params.wts:
$ sed -i '/train.cfacts/ {p;s/train/test/}' params.wts
The reason ProPPR doesn't just automatically do this is that it learns a different weight for each database file. This lets you learn a different confidence level for different databases, for example if you keep a file for each provenance source, or a file for each relation. Inarguably useful, but it does make distinguishing between training and test database files awkward.
Why do I get this "Skipping duplicate fact at context_dep_bow.cfacts:4791862:
hasCloseContext s_155766_15_part10 righttok=accompany"?
I have checked, there is only one line of
hasCloseContext s_155766_15_part10 righttok=accompany
in the .cfacts file.
We use a bloom filter to check for duplicate facts, since unintentional duplicates can cause unexpected results (their weight in the graph gets doubled, or tripled, or...). Since ProPPR does this with a bloom filter and not an exact set-membership check, sometimes we can get false positives (hash collisions). The best thing to do in this case is just turn off duplicate checking.
You can turn off duplicate checking using --duplicateCheck -1
The following features are reserved for use by the ProPPR provers:
id(restart)
id(trueLoop)
id(trueLoopRestart)
id(alphaBooster)
QueryAnswerer and Grounder do similar things (prover-based inference) but for different purposes. QueryAnswerer lists all the solutions to a query, labelled or not. This output can be used for things like sampling negative examples and computing accuracy metrics. Grounder lists the proof graph for a query, including markers for postive-labeled and negative-labeled solution nodes. This output is generally only used for training. A query that don't hit any of the labeled solutions won't have any effect on the gradient at training time, so Grounder skips queries like these. QueryAnswer doesn't care about labels, so it does print output for these queries.
edu.cmu.ml.praprolog.prove.LogicProgramException: Error converting features of rule ...
This happens because ProPPR couldn't bind the variables used in the features of a rule. Remember that the variables in a feature must be bound in the head of the rule -- the body is too late.
Let's say we want to be able to call associatedWith(bear,X)
and learn whether the word "bear" is associated with each of our categories, which will be subbed in for X. We have a facts file that lists all the categories, called isCategory
.
This rule won't work:
associatedWith(Word,Category) :- isCategory(Category) # f(Word,Category) .
...because Category isn't being bound until the isCategory lookup in the body of the rule.
To fix it: move the feature down a level.
associatedWith(Word,Category) :- isCategory(Category),learn(Word,Category) .
learn(Word,Category) :- # f(Word,Category) .
Now when we call associatedWith(bear,X)
, X is bound by isCategory, and then we call e.g. learn(bear,zoology)
and can use those values for the feature f
.
edu.cmu.ml.praprolog.prove.MinAlphaException: minAlpha too high! Did you remember to set alpha in logic program components? dpr minAlpha =0.1 localAlpha=0.08333333333333333 for state ...
This can happen if (1) you're using the default prover and (2) your graph has a fanout of 10 or more.
Background: alpha
is the probability of reset, and minAlpha
is the bound on that probability that makes our pagerank approximation possible. Unfortunately, if you get this error it means our pagerank approximation is no longer accurate and any results you get are basically garbage.
The default minAlpha is 0.1, or 1/10. This means if any node in the graph has more than 10 outgoing edges, the reset probability is going to fall below 1/10, and violate the constraint.
To fix it: Set minalpha according to your graph in your prover specification. If you know the maximum fanout of your graph (i.e. the maximum out-degree of a node), you can use 1/that for minalpha. Otherwise, look at the localAlpha
part of the error message, and pick a number lower than that to use for minalpha. In the example above, 0.08 would be fine. It's a good idea to use the same minalpha value for all your examples, so if you see multiple errors like this, pick the lowest localAlpha value to base your new minalpha off of.
To set minalpha on a dpr prover (the default), use --prover dpr:{epsilon}:{minalpha}
on the command line. [epsilon is the error bound on the approximation; if you're not sure what to put there use 1e-5]