Comparisons Satisfiability Check - ConstantB/ontop-spatial GitHub Wiki

The code for this feature resides in the packages ontop-reformulation-core, mainly in the class it.unibz.krdb.obda.owlrefplatform.core.unfolding.ComparisonsSatisfiability, and in ontop-quest-owlapi3, where there are some test cases.

The unfolding

More precisely, the changes are located in the it.unibz.krdb.obda.owlrefplatform.core.unfolding package; this package implements the generation of an SQL query for a given SPARQL query. As of now (Ontop 1.15), the unfolding Java code incorporates the following steps:

translation to Datalog;
remove irrelevant rules (non-answer queries);
unfolding

*compute partial evaluation*: resolve data atoms, unfold Joins and LeftJoins;

*enforce equalities*; * *flattening*: flatten the query by applying resolution steps to every atom against the rules in the unfolding program; * *replacing equivalences*: remove all references to class/properties with equivalents and replace them by the equivalents; * *normalization*: expand Joins; * *rewriting*: optimize the query with respect to *Sigma rules*; * *partial evaluation*: resolve atoms with unifiable rules in the unfolding program;

again *unfolding*;
again *remove irrelevant rules*;
*evaluate boolean expressions*;

The satisfiability of comparisons is checked during this last step, in particular in the function evaluateExpressions(CQIE), called as a last resort through ComparisonsSatisfiability.unsatisfiable(CQIE).

The comparisons satisfiability check

In order to understand the optimization, let's see an example. Let's consider the two mapping assertions in the Datalog format:

A(X) :- T(X), X > 3
B(X) :- T(X), X < 1

and the SPARQL query

SELECT * WHERE {?x a :A. ?x a :B .} .

The intermediate answer (Datalog rule)

ans1(X) :- T(X), X > 3, X < 1

is clearly unsatisfiable, given the contradicting comparisons regarding the variable X. This means that the answer is always empty, and that this rule could be eliminated, but currently Ontop is not able to acknowledge this unsatisfiability. This implies that the SQL query is executed anyway, resulting in a loss of time.

This article is the starting point for the algorithm that was implemented. In that paper, different cases are considered, according to the kind of arithmetical comparisons and to the datatypes that are allowed. It turns out that the problem of deciding satisfiability of the comparisons when the variables range over natural numbers is much more difficult (NP-complete on the number of comparisons) than the one that concerns variables that range over real numbers (linear).

We decided to implement at first the version for real numbers, since it possesses a feasible computational complexity.

The rough steps executed by the algorithm are the following:

eliminate trivial inequalities;
scan atoms in the body;
encode the minimum ranges in the graph;
encode the constants' ordering in the graph;
compute the strongly connected components of the graph;
inspect the components for violations of the neq constraints.

Given a conjunctive query, the first step is to eliminate trivial inequalities, that is inequalities between constants: this task is not performed in the class ComparisonsSatisfiability, but before, directly during evaluation. The elimination of atoms of the form const1 (op) const2, where (op) is an inequality operation, is done in the functions evalEqNeq, evalGtLt and evalGteLte; the elimination of atoms like var = var or var != var in the function evalEqNeq. In order to achieve so, I had to improve the existing functions, and create from scratch evalGtLt and evalGteLte.

As the next step, all the atoms in the body are scanned in sequence. When an and predicate is encountered, it is flattened.

In case of ors, both the possibilities in the disjunction should be considered. Currently, when a disjunction occurs, the algorithm naively duplicates the current state of the algorithm (the graph built up to now, the body of the query, the disequality constraints, etc.) and runs recursively two satisfiability checks, one for each possibility among the arguments of the disjunction. This is exactly as in the case of an arbitrary propositional logic formula, when satisfiability is checked by exhaustively/recursively splitting/evaluating every disjunction. Of course this algorithm can take time exponential in the size of the length of the query, but usually the nesting and the occurrence of ors is limited.

For each variable occurring in the body, the corresponding minumum range is built: by "minimum range" we mean the interval (at the beginning, the trivial one) that is consecutively shrinked while the constraints on the corresponding variable are scanned. As an example, at first the minimum range for the variable x is [-inf, +inf]; when the atom x =< 1 is found in the body, the range is replaced by [-inf, 1]; and so forth. In order to represent minimum ranges in Java, the specific class DoubleInterval was created.

Each variable is mapped to the list of variables or constants that it should be different to: for example, if x != 9 or x < 9 is encountered, then 9 is appended to the list neq_constraints[x].

A directed graph (the gteGraph) is considered, that has a node for each variable and constant, and an edge v1 --> v2 for every comparison atom of the form v1 >= v2 or v1 > v2 in the body.

Once all the atoms in the body are scanned, and the minimum ranges are built, we can encode the information contained in the latter into the greater-than-or-equal graph. For example, if the minimum range for the variable x is the interval [lowerBound, upperBound], we have to add to the gteGraph the edges x --> lowerBound and upperBound --> x.

We also have to encode in the graph the information about the linear order of the constants, that are converted to real numbers (Doubles). This is because otherwise there would be no information in the graph about the semantics of numbers. We order the constants that occurred in the minimum ranges, and once we know x1 >= x2 >= ... >= xn, we add to the graph the edges x1 --> x2, ..., x(n-1) --> xn.

The next step is to compute the strongly connected components of the geqGraph: these components are sets of nodes that are necessarily to coincide, given the comparisons in the body. This is because every node in the component is both greater-or-equal-than and smaller-or-equal-then any other node in the component, and this means it must be actually equal to it.

Once we computed the strongly connected components, we have to check two things: the first is whether a component contains two different constants, since this would be a contradiction, meaning that the query is unsatisfiable because it requires different numbers to be equal. The second thing is to check the neq_constraints map that was constructed earlier. For every variable x, if the component containing x contains some element of neq_constraints[x] as well, then the query is unsatisfiable.

If the query is found to be unsatisfiable, it is eliminated during evaluation as if it evaluated to the false atom (OBDAVocabulary.FALSE); if instead it is satisfiable, or if this test cannot show the query not to be (the algorithm is not complete), the original query is kept.

This algorithm is not complete, since for example it treats any number occurring in the query as a real number (Double). This means that the Datalog query

ans1(X) :- A(X), B(Y), 0 =< X =< 1, 0 =< Y =< 1, X != Y.

would be classified as not unsatisfiable, even if the comparisons are not satisfiable if the variables range only over natural numbers.

Test cases

The test cases use an ontology (OWL file) that declares concepts as Lt1or2 or Lt1orGt7 : the intuition is that an individual should belong to these concepts if it corresponds to some data that is respectively less-than-one-or-less-than-2, and less-than-1-or-greater-than-7. These concepts are useful since they allow to have queries with disjunctions occurring in them. In the ontology there are also the predicates Gt, Lt, etc. They are used at the same time as roles and as (data) properties. This means that in a query can occur in the right hand side both a "materialized" individual (ex: :obj1 :Gt :obj2), meaning that the two individuals in that relation correspond to two pieces of data that are in the corresponding arithmetical relation; in the right hand side can also occur a value (ex: :obj1 :Gt '2'^^xsd:int), and the intended meaning from the mapping is similar.

The schema for the data is a dummy SQL table with one dummy column COL.

As usual the ontology, that can be found in /ontop-quest-owlapi3/src/test/resources/comparisons/in.owl, is loaded through OWLOntologyManager, while the OBDA data, that contains the mapping declarations, can be found in /ontop-quest-owlapi3/src/test/resources/comparisons/in.obda; here the mappings in the Ontop mapping language consist of a source (an SQL query), and a target (RDF triples).

In order to run the tests, an instance of the QuestOWL reasoner is created, and then new QuestOWLStatements to execute the SPARQL queries that we want to unfold. If the SPARQL query is found to be unsatisfiable, then the unfolded SQL query will be the empty string.

References

[1] Sha Guo et al: “Solving Satisfiability and Implication Problems in Database Systems”