QLever performance evaluation and comparison to other SPARQL engines - ad-freiburg/qlever GitHub Wiki

QLever performance evaluation and comparison to other SPARQL engines

Here are the results of a simple performance evaluation and comparison of QLever, Virtuoso, Blazegraph, GraphDB, Stardog, Apache Jena, and Oxigraph on a moderately-sized dataset (DBLP) and on Wikidata (for those engines providing a public endpoint). More engines and more datasets will be added in the future. However, since all of the metrics below essentially scale linearly with the size of the dataset (at least for QLever), already the results on this one dataset say a lot.

All evaluations (of all engines) were run on an AMD Ryzen 9 7950X with 16 cores, 128 GB, and 7.1 TB of NVMe SSD. This is high-quality but affordable consumer hardware (as opposed to typical server hardware), with a total cost of around 2500 €.

Evaluation and comparison on the DBLP dataset (390 M triples)

The dataset used was the RDF dump of DBLP, version 02.04.2024 (1.8 GB compressed, 390 million triples, 68 predicates, see this SPARQL endpoint).

The following table compares loading time (in seconds), loading speed (million triples per second), and index size (in Gigabytes). The next to last column shows the average query time for the small benchmark detailed in the next section. The last column provides a subjective assessment of how easy or not it was to build the index and run queries (Blazegraph requires explicit chunking to load larger datasets, GraphDB's normal load takes forever, Virtuoso is old and error-prone with unusual interfaces, the setup for Stardog was by far the most complicated of all, see Section "Command lines ..." below).

SPARQL engine Code Loading time Loading speed Index size Avg. query time Ease of setup
Oxigraph Rust 640s 0.6 M/s 67 GB 93s very easy
Apache Jena Java 2392s 0.2 M/s 42 GB 69s very easy
Stardog Java 724s 0.5 M/s 28 GB 17s many hurdles
GraphDB Java 1066s 0.4 M/s 28 GB 16s some hurdles
Blazegraph Java 6326s <0.1 M/s 67 GB 4.3s some hurdles
Virtuoso C 561s 0.7 M/s 13 GB 2.2s many hurdles
QLever C++ 231s 1.7 M/s 8 GB 0.7s very easy

The following table compares query processing times on six queries from the "Examples" of https://qlever.cs.uni-freiburg.de/dblp. The queries were selected for their variety (see the "Comment" column), not to make a particular engine look particularly good or bad. For each engine, the query times were measured after emptying the disk cache with sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches" and starting the respective server from scratch. For QLever, its internal cache was cleared after each query (this makes it harder for QLever). For the other engines, no such precautions were taken. There was no significant (IO-heavy or CPU-heavy) activity on the machine during the evaluation. The > in one the table cells below indicates that Virtuoso, due to an internal limitation, downloaded only 1,048,576 of the around 7M results for the respective query.

Query Result shape Oxigraph Jena Stardog GraphDB Blazegraph Virtuoso QLever Comment
All papers published in SIGIR 6264 x 3 1.6s 0.3s 0.52s 0.17s 0.47s 0.54s 0.02s Two simple joins, nothing special
Number of papers by venue 19954 x 2 2.6s 28s 2.0s 3.1s 1.2s 1.0s 0.02s Scan of a single predicate with GROUP BY and ORDER BY
Author names matching REGEX 513 x 3 5.6s 4.8s 0.61s 0.29s 0.27s 0.98s 0.05s Joins, GROUP BY, ORDER BY, FILTER REGEX
All papers in DBLP until 1940 70 x 4 313s 50s 16s 0.04s 5.9s 0.08s 0.11s Three joins, a FILTER, and an ORDER BY
All papers with their title 7167122 x 2 132s 54s 44s 20s 18s >9.1s 4.2s Simple, but must materialize large result (problematic for many SPARQL engines)
All predicates ordered by size 68 x 3 106s 279s 37s 72s 0.05s 1.48s 0.01s Conceptually requires a scan over all triples, but huge optimization potential

Performance comparison of four Wikidata endpoints

The following is a performance comparison of four SPARQL endpoints for Wikidata on 298 example queries from the Wikidata Query Service. Column 6 provide the average query time only for those queries that did not fail; this gives an undue advantage to engines were many queries fail and therefore this number should be taken with a grain of salt. Note that MilleniumDB does not appear in the performance evaluation for DBLP above because even for that simple benchmark, with only six relatively straightforward queries, half of the queries fail.

Wikidata SPARQL endpoint query time <= 1.0s (1.0s, 5.0s] > 5.0s failed avg. query time median query time
Wikidata Query Service 36% of all queries 20% 23% 21% 6.98s 2.47s
QLever 78% of all queries 11% 9% 2% 1.38s 0.24s
Virtuoso 54% of all queries 15% 20% 11% 4.11s 0.74s
MilleniumDB 12% of all queries 22% 11% 55% 6.05s > 50% failed

Results were obtained using the following command-line, which launches one query after the other and records the time and results size (or whether the query failed):

qlever example-queries --get-queries-cmd "cat wikidata-queries.tsv" --download-or-count download --accept application/sparql-results+json --sparql-endpoint <URL of SPARQL endpoint>

The numbers for the table were obtained from the respective outputs using this command-line:

for RESULTS in wikidata-queries.*-results.txt; do printf "%-45s  %4d  %4d  %4d  %4d  %7.2fs  %6s\n" $RESULTS $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 <= 1.0' | wc -l) $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 > 1.0 && $1 <= 5.0' | wc -l) $(cat $RESULTS | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | awk '$1 > 5.0' | wc -l) $(cat $RESULTS | \grep -o "  failed  " | wc -l) $(\grep ^AVERAGE $RESULTS | \grep -o ' [0-9]\+\.[0-9][0-9] s' | sed 's/[ s]//g') $(cat $RESULTS | \grep "^ *[0-9]\+ " | sed 's/  failed  /  60.00 s  /g' | \grep -o " [0-9]\+\.[0-9][0-9] s" | sed 's/[ s]//g' | datamash median 1 | xargs printf "%6.2fs" | sed 's/60.00s/failed/'); done

Command lines for producing the results for DBLP above (loading and queries)

For each engine, we created a folder with only the input file dblp.ttl.gz and a file queries.tsv obtained via curl -s https://qlever.cs.uni-freiburg.de/api/examples/dblp | sed -n '3p;4p;5p;6p;10p;15p' > queries.tsv (see below for the contents). For Virtuoso, there was also the config file virtuoso.ini (with generous settings regarding memory consumption). For QLever, there was the config file Qleverfile (with standard settings).

Oxigraph

git clone --recursive [email protected]:oxigraph/oxigraph.git
cd oxigraph/cli && cargo build --release && export PATH=$PATH:<oxigraph dir>/target/release

oxigraph load -f dblp.ttl.gz -l .
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
oxigraph serve-read-only -l . -b localhost:8015
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8015/query

Apache Jena

wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-5.0.0.zip
unzip apache-jena-fuseki-5.0.0.zip && rm -f $_
wget https://dlcdn.apache.org/jena/binaries/apache-jena-5.0.0.zip
unzip apache-jena-5.0.0.zip && rm -f $_
sudo apt update && sudo apt install -y openjdk-21-jdk
sudo update-alternatives --config java  ->  select JDK 21 (auto mode)

apache-jena-5.0.0/bin/tdb2.xloader --loc data dblp.ttl.gz
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
java -jar apache-jena-fuseki-5.0.0/fuseki-server.jar --port 8015 --loc data /dblp
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8015/dblp

Stardog

sudo apt install gnupg
curl http://packages.stardog.com/stardog.gpg.pub | sudo apt-key add
echo "deb http://packages.stardog.com/deb/ stable main" | sudo tee -a /etc/apt/sources.list
sudo apt-get update
sudo apt-get install -y stardog=10.0.1
sudo apt-get install bash-completion
source /opt/stardog/bin/stardog-completion.sh
sed -i 's/UseParallelOldGC/UseParallelGC/' /opt/stardog/bin/helpers.sh
export STARDOG_SERVER_JAVA_ARGS="-Xms20g -Xmx20g"
export STARDOG_PROPERTIES=$(pwd) && echo "memory.mode = bulk_load" > stardog.properties

stardog-admin server start
stardog-admin db create -n dblp dblp.ttl.gz
stardog-admin server stop
rm -f stardog.properties
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
stardog-admin server start --disable-security
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:5820/dblp/query

GraphDB

Fill out form on https://www.ontotext.com/products/graphdb/download/
Click on link "Platform-independent distribution" in mail and download graphdb-10.6.2-dist.zip
unzip graphdb-10.6.2-dist.zip && rm -f $_

graphdb-10.6.2/bin/console
> create graphdb   [ID = dblp, rest = default]
> quit
graphdb-10.6.2/bin/importrdf preload -f -i dblp dblp.ttl.gz
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
graphdb-10.6.2/bin/graphdb
curl -s localhost:7200/repositories/dblp --data-urlencode 'query=SELECT * { ?s ?p ?o } LIMIT 1'   [minimal warmup]
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:7200/repositories/dblp

Blazegraph

wget https://github.com/blazegraph/database/releases/download/BLAZEGRAPH_2_1_6_RC/blazegraph.jar

java -server -Xmx20g -jar blazegraph.jar &
docker run -it --rm -v $(pwd):/data stain/jena riot --output=NT /data/dblp.ttl.gz | split -a 3 --numeric-suffixes=1 --additional-suffix=.nt -l 1000000  --filter='gzip > $FILE.gz' - dblp-
for CHUNK in dblp-???.nt.gz; do curl -s indus:9999/blazegraph/namespace/kb/sparql --data-binary update="LOAD <file://$(pwd)/${CHUNK}>"; done
kill %1
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
java -server -Xmx20g -jar blazegraph.jar &
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:9999/blazegraph/namespace/kb/sparql

Virtuoso

git clone https://github.com/openlink/virtuoso-opensource virtuoso
sudo apt install -y autoconf automake libtool flex bison gperf gawk m4 make openssl
libsrc/Wi/sparql_io.sql  ->  change maxrows := 1024*1024: to 2*1024*1024 -2;   [see https://github.com/openlink/virtuoso-opensource/issues/700]
./autogen.sh && ./configure && make && sudo make install
virtuoso.ini -> change: ServerPort = 8888, NumberOfBuffers and MaxDirtyBuffers = presets for 64 GB free, DefaultHost = hostname:8890, DirsAllowed = directory with the input files, ResultsSetMaxRows = 2000000, MaxQueryCostEstimationTime = 3600, MaxQueryExecutionTime = 3600, MaxQueryMem = 20G

isql-vt 8888
SQL> ld_dir('/local/data/qlever/qlever-indices/virtuoso-playground.ssd', 'dblp.ttl.gz', '');
SQL> DB.DBA.rdf_loader_run();
SQL> checkpoint;
SQL> exit;
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
/usr/bin/virtuoso-t -f &
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:8890/sparql

QLever

pip install qlever

qlever index
sudo bash -c "sync; sleep 5; echo 3 > /proc/sys/vm/drop_caches"
qlever start
qlever example-queries --get-queries-cmd "cat queries.tsv" --download-or-count download --sparql-endpoint localhost:7015

Contents of queries.tsv

All papers published in SIGIR	PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?paper ?title ?year WHERE { ?paper dblp:title ?title . ?paper dblp:publishedIn "SIGIR" . ?paper dblp:yearOfPublication ?year } ORDER BY DESC(?year)
Number of papers by venue	PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?venue (COUNT(?paper) as ?count) WHERE { ?paper dblp:publishedIn ?venue } GROUP BY ?venue ORDER BY DESC(?count)
Author names matching REGEX	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?author ?author_label ?count WHERE { { SELECT ?author ?author_label (COUNT(?paper) as ?count) WHERE { ?paper dblp:authoredBy ?author . ?paper dblp:publishedIn "SIGIR" . ?author rdfs:label ?author_label } GROUP BY ?author ?author_label } FILTER REGEX(STR(?author_label), "M.*D.*", "i") } ORDER BY DESC(?count)
All papers in DBLP until 1940	PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX dblp: <https://dblp.org/rdf/schema#> PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> SELECT ?title ?author ?author_label ?year WHERE { ?paper dblp:title ?title . ?paper dblp:authoredBy ?author . ?paper dblp:yearOfPublication ?year . ?author rdfs:label ?author_label . FILTER (?year <= "1940"^^xsd:gYear) } ORDER BY ASC(?year) ASC(?title)
All papers with their title (large result)	PREFIX dblp: <https://dblp.org/rdf/schema#> SELECT ?paper ?title WHERE { ?paper dblp:title ?title }
All predicates, ordered by number of subjects	SELECT ?predicate (COUNT(?subject) as ?count) WHERE { ?subject ?predicate ?object } GROUP BY ?predicate ORDER BY DESC(?count)