Getting Started with SPARQL - strohne/Facepager GitHub Wiki
[This file is currently work in progress]
This Getting Started contains a universally transferable, beginner-friendly introduction to SPARQL (pronounced "sparkle"), a semantic query language used to access and retrieve data stored in graph-like RDF (Resource Description Framework) databases such as the Culture Knowledge Graph. Making an effort to learn the basics of SPARQL and its syntax will allow you to literally understand any graph database query and, more crucially, will let you write your own queries to fetch data from all kinds of public knowledge graphs.
Now, you've already learned how to pronounce SPARQL. Great! Yet, simply looking at a SPARQL query can be daunting. The whole structure and vocabulary, everything seems cryptic at first, not to mention the headache of writing a query from the ground up. That's no surprise! The syntax behind how data is stored was meant to provide it with meaning in a machine-readable form. While SPARQL allows us (humans) to query all kinds of databases, it still follows a syntax that was tailored to machines not humans. In the following sections, we will break down the logic behind SPARQL step-by-step so you will eventually be able to read its syntax just as easily as this introduction.
"What now are RDF Triples? Did you not want to teach me about SPARQL?!", you ask. Well, yes, but to do so, let's first take a step back: Imagine you had an appointment with the authorities but not only is every door labelled in a different language which makes it hard for you to find the correct room, you also have to speak different languages when applying for a passport versus registering a new place of residence. That would not be particularly efficient, would it? The same applies to huge databases where lots of different information are stored. To prevent users from having to learn a new language every time they wanted to retrieve or store data a common standard for communicating with database was needed. The Resource Description Framework, or RDF for short, is just that: a standard model for data interchange on the web. It has a simple syntax that is based on triples consisting of a subject, a predicate, and an object (e.g. "Susan has age 42"). Several such triples can be formalised as networks or knowledge graphs. Note how the syntax does not adhere to natural language grammar ("has age" is the predicate whereas "42" is the object). Don't worry, this peculiarity becomes second nature after a short time of working with data triples. SPARQL, on the other hand, is the semantic query language that is commonly used to query data from RDF's triple stores. Because RDF dictates data to be stored in triples, we can retrieve it using triple statements as well. Thus, all SPARQL statements are made up of the same three elements: a subject, a predicate and an object. A person's age, for instance, can be formulated as follows: Susan (subject) has age (predicate) 42 (object). In principle, all dates can be formalised as such triples.
?sub ?pred ?obj .
Susan has age 42 .
The RDF was developed to bring various different data models down to the lowest common denominator, and yet it does not define the format in which data is stored. Formats range from JSON-LD, XML/RDF to text formats such as Turtle. Often several formats are available to choose from, as RDF makes them interchangeable. For a detailed introduction to RDF see, Jünger and Gärtner (2023) Computational Methods for the Social Sciences and Humanities (Chapter 3.7).
Many free online databases let users execute queries via so-called SPARQL endpoints that are implemented in RDF databases. These endpoints act as an interface between the user and the underlying RDF data, enabling queries to be submitted over the web. Well-known SPARQL endpoints are provided by DBpedia or Wikidata. The NFDI4Culture also provides a public SPARQL endpoint. You will get to use them soon!
Let's now turn to the heart of it all. SPARQL queries, as already established, allow you to search and query data from RDF formatted databases or knowledge graphs. While at its core, a SPARQL query consists of triple statements, it features all of the following basic elements:
- PREFIX: Defines prefixes for namespaces to, one, improve the readability of queries and, two, tell what data can be queried using which vocabulary. Do not worry about it for now. We will turn to Namespaces and Vocabularies shortly.
- SELECT: Determines what results will be returned from the query.
-
WHERE: Defines the patterns that are searched for in the RDF graph. Remember all data can be found using triples and here is where to put them. Because we want to request information, we work with placeholders or variables that are embedded in the triple structure. Variables can be recognised by the
?
in front of them. - OPTIONAL: Allows optional patterns that do not necessarily have to be present in the graph. This is especially helpful, if you query for data that is not certainly available.
- FILTER: Used to filter results based on predefined conditions.
-
SORTING PARAMETERS: Use
LIMIT
,GROUP BY
orORDER BY
to limit or sort the returned results. Without an explicit instruction, the results are returned in the order in which they are processed by the RDF database or the SPARQL endpoint.
For starters, we can create a simple SPARQL query by asking for an unspecified triple of subject, predicate, and object. Remember to set a low LIMIT
. The less specific your query, the higher the chance that an astronomical number of results will bring the query to its knees. Let's try to query the DBpedia SPARQL Endpoint. Simply, paste the query below and hit Execute Query.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?sub ?pred ?obj
WHERE {
?sub ?pred ?obj .
} LIMIT 10
This pattern will match any triple in the RDF graph, effectively retrieving all triples at once (limited to the first 10 results). Congratulations! You have successfully created your first SPARQL query. Now, let's try something a little more ... insightful that we can actually make sense of. Let's catch the first ten persons' names, stored in DBpedia. Again, paste the query below and hit Execute Query.
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name
WHERE {
?person foaf:name ?name .
} LIMIT 10
That worked! But notice how the returned names do not seem to belong to humans but to all kinds of entities? That's because ?person
is a variable with no meaning. To fetch the names of actual persons, we have to incorporate a triple statement that tells the database that we are only looking for entries with the attribute person. Notice, how by using a ;
we separated multiple predicates that refer to the same subject:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name
WHERE {
?entity a foaf:Person ;
foaf:name ?name .
} LIMIT 10
Far better! Using SPARQL, we used the first triple to tell that our variable ?entity
(subject) is a
(predicate) foaf:Person
(object). By doing so, the database we are fetching data from knows we only look for entities that are defined as a person through the FOAF (Friend of a Friend) namespace. FOAF provides terms for describing people (Note how we had to define the prefix first!). Using ;
we then added another triple referring to the ?entity
again. Here we applied the same predicate (foaf:name
) as in our first query. This predicate points to the entity's name. Lastly, we stored the resulting object in another variable (?name
). As we only selected ?name
to be returned, that is all we see after running the query. Try to select ?entity
as well and see what happens.
SELECT ?name ?entity
Literally, the query can be read aloud as follows: ?entity
is a person (as defined by FOAF) and has the name ?name
.
Now that you know the basic syntax, let's talk about prefixes, namespaces, and vocabularies. You already encountered a common namespace, namely Friends of a Friend (FOAF). A namespace defines the vocabulary used to clearly label the elements of a triple. Vocabularies contain defined expressions for certain categories such as names or birthdays and all kinds of other information. FOAF, for example, stores predicates such as names, addresses or acquaintances in a standardised manner. Other common vocabularies come from schema.org or DBPedia itself. Within a SPARQL query, we call on this vocabularies by setting a Prefix. Prefixes are abbreviations of their namespaces' full URIs:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX schema: <http://schema.org/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
A collection of related terms, for example that people have a name or a date of birth, or that a book has a title and a publishing year is called an ontology. Ontologies are formalised with languages such as the Web Ontology Language (OWL).
Facepager supports querying graph databases using SPARQL via the Generic module at the moment. We are currently working on a dedicated SPARQL module that will allow users to build and test queries more practically without having to leave the software. However, there is one central peculiarity that always needs to be taken into account when issuing a SPARQL query using Facepager. Facepager marks seed node placeholders such as the object ID with arrow heads: <Object ID>
. As you may have noticed, usually, the namespaces in SPARQL queries are also embraced by arrow heads. In order for Facepager to resolve a SPARQL query correctly, every arrow head must be escaped \<LINK TO ONTOLOGY\>
, except of course the ones marking the <Object ID>
(or any other intentional placeholders for that matter). For an illustration, see the same SPARQL query as before, optimised for Facepager.
PREFIX foaf: \<http://xmlns.com/foaf/0.1/\>
SELECT ?name
WHERE {
?entity a foaf:Person ;
foaf:<Object ID> ?name .
} LIMIT 10
Now the query will work in Facepager too. Further, creating a seed node name
will replace the <Object ID>
and restore our original query. However, you can also create other seed nodes, for example surname
, age
, birthday
, and so on (check the whole FOAF vocabulary here). Each will return different data. While a fairly simple example, it goes to show, how placeholders work in Facepager. Another use case could be to filter for other parameters. There are no limits to your creativity (apart from the availability of data ;) ).
[description of SPARQL module once implemented]
- A great resources to learn more about SPARQL and some of its advanced syntax quirks is Wikidata's introduction to SPARQL.
- Check out our Getting Started with Knowledge Graph to behold the combined might of SPARQL and Facepager in an hands-on tutorial.