datalog_prov - IITDBGroup/gprom GitHub Wiki

Datalog Provenance

GProM supports the generation of provenance graphs that explain why a tuple exists / is missing in the result of a Datalog query. Explanations for Datalog and how to run scripts are available at the lang_datalog page (https://github.com/IITDBGroup/gprom/wiki/lang_datalog). In the following, we show some example programs with provenance requests (more in the examples/datalogAndProvGraphs directory), and corresponding provenance graphs (more in the examples/datalogAndProvGraphs/graphs) over the example relation HOP.

Graph Types

  • Provenance Graph justifies the existence (absence) of a result as the success (failure) to derive the result through the rules of the query. Furthermore, it explains how the existence (absence) of tuples in the database caused the derivation to succeed (fail). There are three types of nodes in the graph: rule nodes (boxes labeled with a rule identifier and the constant arguments of a rule derivation), goal nodes (rounded boxes labeled with a rule identifier and the goal’s position in the rule’s body), and tuple nodes (ovals). The nodes are either colored as green (successful/existing) or red (failed/missing).

  • Provenance Game models the evaluation of a given query (input program) P over an instance I as a 2-player game in a way that resembles SLD(NF) resolution. If the position (a node in the game graph) corresponding to a tuple t is won (the player starting in this position has a winning strategy), then t is in the result of P and if the position is lost, then t is not in the result. There are four types of nodes: rule nodes (boxes labeled with a rule identifier and the constant arguments of a rule derivation), goal nodes (rounded boxes labeled with a rule identifier and the goal’s position in the rule’s body), IDB tuple nodes (ovals), and EDB tuple nodes (boxes labeled with EDB relation and constant of a tuple). More detailed information about provenance games can be seen from our technical report (http://cs.iit.edu/%7edbgroup/pdfpubls/LW15a.pdf).

Example Relation HOP

Examples for Why Provenance



The explanation is shown as a provenance game on the left, which explains c can be reachable from a indirectly, but not directly. The person claiming this (the top most node) has a winning strategy through a successful rule derivation, e.g., b allows both first goal (first derivation from the right side in the graph) and second goal (first one from the left) are successful as corresponding tuples (a,b) and (b,c) exist in the database. The third goal in the example program is also successful since no direct connection from a to c (the one in the middle) exists in the database.



The provenance graph on the left is shown as an explanation for the input program, i.e., c can be reached from a through a (self-loop) and b as 2 intermediate values. Every node in the graph is in green, because only a successful derivation exists and tuples to explain this derivation (the leaf nodes with (a,a), (a,b), and (b,c)) is in the database.




The explanation is shown as provenance graph on the left which explains why a can be together with a and b, but not with c in the result. Since the query for the provenance question, called answer relation (e.g., XwithYnotZ), contains a sub-query (the IDB head predicate of another Datalog query, i.e., Q in this example), the graph has another derivation under Q (in this example, the derivation is failed as Q is negated in the query for the answer relation), which explains why there is no connection from a to c (colored in red as no tuple (a,c) exists).


Examples for Why-Not Questions



The provenance game shown on the left explains why a cannot be reached from c within 2 hops. The game presents all the possible ways of failed rule derivations, e.g., there exist a failed rule derivation with b (the first derivation from the right in the graph) since no tuple (c,b) exists in the database. The failed rule nodes are still in green, because the person claiming this failed derivation would win the game over this query evaluation.


The explanation represented as a provenance graph above justifies the query evaluation, i.e., why c cannot reach a within 3 hops. All the failed derivations are part of the graph and nodes are colored in red as failed. For example, the very first rule derivation from the left shows it is one of the ways to be failed since there do not exist tuples (c,b) and (b,b) (but a tuple (b,a) exists so that it cannot be part of the graph).



The provenance graph on the left explains c cannot be paired with b in the result. Only one failed derivation is shown, because there is no tuple (c,b) in the database so as failed rule derivation in red. The negated goal cannot be part of the explanation since it is successful as no tuple (c,a) exists in the database.

⚠️ **GitHub.com Fallback** ⚠️