QLever support for SPARQL 1.1 Update - ad-freiburg/qlever GitHub Wiki

This page is outdated. In the meantime, there is a proof-of-concept implementation for SPARQL Update in https://github.com/ad-freiburg/qlever/pull/916 . It will be integrated into QLever over the next months.

QLever currently does not support SPARQL 1.1 Update operations. However, we plan to do so, and this document outlines the implementation plan. To understand this document, you should understand QLever's basic architecture, in particular, the basic index data structures (QLever's "permutations") and how QLever internally maps each IRI or literal to a unique ID (QLever's "internal and external vocabulary"). As it turns out, the following ideas are also relevant for an efficient implementation of SPARQL 1.1 Federated Query.

Two indexes A and B

The main idea behind this implementation plan is to build and maintain two fully-functional indexes. One index, called A in the following, is built for a snapshot of the set of RDF triples at a certain point in time T. Another index, called B, is built for all the RDF triples that are added or deleted after time T.

Index A is periodically built from scratch. For a dump of the complete Wikidata, an index build currently takes around 14 hours (on a PC without special hardware). So index A could be rebuilt from scratch every day.

Index B is also periodically built from scratch, but at a much higher frequency. For Wikidata, the amount of edits that accumulate over the period of a few days is in the millions. QLever can build an index from scratch with a speed of around half a million triples per second. So index B could be rebuild once every few seconds or every minute.

When a triple that was contained in the data at time T (and is hence contained in A) is deleted after time T, we also insert it into B, but with a tag deleted. When a triple contained in A is added again, we just ignore it. For any sequence of update operations on the same triple, we can trivially compute during the construction of B whether the eventual status is that the triple is in or out. Without loss of generality, we therefore have two kinds of triples in B: triples that are not contained in A and added in B, and triples that are contained in A and added in B with tag deleted.

It is important how we assign IDs to the IRIs and literals of the triples in B. IRIs or literals that already exist in A must get the same ID as in A. IRIs or literals that do not exist in A must get an ID that is larger than all IDs in A. Along with each new ID from B we also store the largest ID in A for which the IRI or literal is <= the new IRI or literal in the respective natural order. Given the small number of triples in B, this is easy to compute and store.

Query processing with two indexes

When processing a query, we have two types of operations: index scans (the leaves of the query execution tree) and all other operations (the inner nodes of the query execution tree). [TODO: Write a document about QLever's query execution tree. In a nutshell, it's simply the tree of operations carried out by QLever to compute the result for a query.]

For an index scan, we have to read from both A and B. In the current version of QLever (with just one index), an index scan just reads a segment from a file on disk and writes it into memory. With two indexes, we read the corresponding segments from A and B, merge them, and write the merged result into memory. The end result has the same form as in the current version of QLever. When B is small (which is the case for our target scenario, where the number of updates is relatively small compared to the size of the whole data), there is only a small performance penalty, because the merge can just scan over large segments of the data from A.

During this merge, we can easily take care of triples that were added to B with a delete tag. Namely, during the merge, we will encounter such a triple together with its corresponding triple from A and then we can simply ignore it, that is, not write it to the merged result (it's like matter and anti-matter colliding and disappearing into nothing).

For an intermediate operation, we can proceed almost as before. The input tables are of the same form as in the single-index case, with one difference. Namely, if an input table is sorted by a particular column, the order of the IDs in that column now does not fully correspond to the "natural" order of the respective IRIs or literals. However, since the new IDs from B are all larger than the IDs from A, the table consists of a top part (with the old IDs in the respective column) and a bottom part (with the new IDs in the respective column), such that the "natural-order" property holds for each part. It is easy to keep track of the dividing line (row index) between the two parts.

For many operations, the "natural-order" property is not important. In particular, it is not important for JOIN operation: all we need there is that the IDs are sorted. For some operations, the "natural-order" property is important. In particular, it is important to efficiently process FILTER operations involving a prefix regular expression and ORDER BY operations. Whenever the "natural-order" property is needed, we can just carry out the respective operation on both parts of the input table (for each of which the "natural-order" property holds, see above), and then merge the results. For the merging, for the IDs from B we need the pre-computed mapping (see above) to where they "fit into" the IDs from A.

Optimizations

A large part of the cost of building the index for B is locating the (nearest or exact) IDs from B in the large vocabulary of A. For a sequence of rebuilds of B relative to the same index A, we can reuse the results of the last such locating. So if, for example, we rebuild B once every minute, we only need to do the locating for the (IDs of the) triples added in the last minute.