QLever's IDs for IRIs and literals - ad-freiburg/qlever GitHub Wiki

In QLever, each IRI or literal has a unique ID. The literals are divided into several groups, depending on their data type (string, integer, decimal, date, etc.). The IRIs are in a group of their own. Each group has its own range (= interval) of IDs which don't overlap. For each group, the IDs have the following important properties:

For the IRIs, the order of the IDs corresponds to the lexicographical order of the IRIs. That is, for two IRIs i1 and i2, ID(i1) < ID(i2) if and only if i1 < i2. [TODO: explain details of the lexicographic order and how it can be configured via the settings.json file]
For each literal type, the order of the IDs corresponds to the "natural" order of the values of this literal (where "corresponds" is meant in exactly the same sense as explained for the previous item). For example, for string literals, the natural order is the lexicographic order. For integer literals, the natural order is the order by integer value. For date literals, it is the chronological order. And so on.
For the IRIs, the IDs are of two kinds. One kind are called "milestone IDs", which are just multiples of some configurable constant. The translation of milestone IDs to IRIs reside in memory. This is QLever's so-called internal vocabulary. The translation of all other IDs to IRIs resides on disk. This is QLever's so-called external vocabulary. The internal vocabulary has faster access time, but has limited size. It is configurable, which IDs/IRIs reside in the internal vocabulary. Note that while the entirety of the IRI IDs are contained in their own interval (disjoint from the intervals for the literal IDs), they are not necessarily contiguous. In fact, they cannot be if we want the "internal" IDs to be multiples of a constant and the order property from 1 at the same time.

Current version of QLever

In the current version of QLever (which will soon be updated), internal and external IDs were in disjoint intervals. More specifically, all external IDs were larger than all internal IDs. This is easy to implement, but leads to wrong results in special cases. For example, the following query outputs a list of names of countries containing a capital A, first all the English names (these are in the internal vocabulary) in lexicographical order, followed by all the German names (these are in the external vocabulary) in lexicographical order. The correct result would interleave the names, so that the whole list is in lexicographic order.

Try query on QLever

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?name WHERE {
  { ?country wdt:P31 wd:Q6256 . ?country rdfs:label ?name . FILTER (LANG(?name) = "de") }
  UNION
  { ?country wdt:P31 wd:Q6256 . ?country rdfs:label ?name . FILTER (LANG(?name) = "en") }
  FILTER REGEX(?name, "A")
}
ORDER BY ASC(?name)