Differences between QLever's Wikidata SPARQL endpoint and the Wikidata Query Service (WDQS) - ad-freiburg/qlever GitHub Wiki

This document describes the differences between the SPARQL endpoints for Wikidata provided by https://qlever.cs.uni‐freiburg.de/wikidata (called QLever in the following) and https://query.wikidata.org (called WDQS in the following), respectively.

Dataset

@Extent: Both endpoints query the complete Wikidata, including statements nodes and lexemes. As of this writing, these are around 19 billion triples. QLever also contains abstracts from the English Wikipedia (which can be accessed via the property schema:description, with the URL of the Wikipedia article as subject), as well as all sentences from the English Wikipedia that mention an entity from Wikidata. This additional information is useful when you want to combine text search with SPARQL search.

@Actuality: WDQS always queries the latest data. That is, when you or someone makes a change on https://wikidata.org, which affects a particular query, then you would see the change in the query result shortly afterwards (there may be a latency of around a minute). QLever queries the latest data from https://dumps.wikimedia.org/wikidatawiki/entities, which is currently updated once per week by the Wikidata team. There is currently no easily accessible stream of the updates to the data. Once that becomes available, QLever will switch to daily updates and then short-time updates like the WDQS.

@Sitelinks: The Wikidata RDF data provides a count of the number of sitelinks for each entity (a sitelink is a Wikimedia page about that entity). This count is a good proxy of the popularity or importance of an entity. In the dataset, each Wikidata entity is connected to the number of its sitelinks via the predicate path ^schema:about/wikibase:sitelinks. Due to efficiency issues, WDQS instead alters this to just wikibase:sitelinks. QLever does not modify the data in any way.

@Labels: The Wikidata RDF data provides labels for each entity via the predicates rdfs:label and schema:name. Both predicates are identical, and WDQS only provides rdfs:label. Using rdfs:label directly is inefficient with WDQS, and the use of SERVICE wikibase:label is recommended. QLever has no efficiency issues with using rdfs:label. Make sure to specify the language in which you want a label, via something like FILTER(LANG(?label) = "en"). Otherwise you get labels in all available languages, which is typically not what you want.

Realization

@Efficiency: WDQS is based on the Blazegraph SPARQL engine. Blazegraph is mature software, with an almost complete and almost error-free implementation of the SPARQL 1.1 standard. However, Blazegraph has major efficiency issues with datasets as large as Wikidata. As a rule of thumb: whenever a query has to look at a large part of the data (millions of triples), the query will be slow and possibly time out. QLever is much more efficient and has no problem processing queries even when they have to touch billions of triples. TODO: link to separate post for more info.

@UI: The user interface of WDQS also comes from Blazegraph. It is simple and functional. It provides autocompletion for individual tokes, provided that you have already typed the prefix (for example, you can type wd: or wdt: and then press Ctrl+Space for completions). It provides various result views, including a table view and a map view for items that have coordinates. QLever comes with its own user interface. It provides a more powerful autocompletion, which depends on the part of the query already typed and does not require that you commit to a prefix before you get completions. There is also a table view and a map view for the result, in case the last column contains items that can be shown on a map. The map view is very powerful. It can show not just points, but complex geometries, and it can show millions or even tens of millions of them interactively.