Wikidata, Linked Data, SPARQL - bounswe/bounswe2024group10 GitHub Wiki
- Wikidata is a free and open knowledge base that can be read and edited by both humans and machines. Wikidata acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others.
- Loosely, it can be described Wikidata as Wikipedias database with over 46 million data items (April 2018).Wikidata also provides support to many other sites and services beyond just Wikimedia projects.
Like Wikipedia, there are all kinds of data stored in Wikidata. As such, when you are looking for a specific dataset or if you want to answer a curious question, it can be a good start looking for that data at Wikidata first.
Example questions:
- What is the capital city of every member of the European Union and how many inhabitants live there?
- Which countries use 112 as an emergency number?
- A free and open knowledge base that can be read and edited by both humans and machines.
- It contains various data types (e.g. text, images, quantities, coordinates, geographic shapes, dates).
- It uses SPARQL.
- SPARQL is a query language for RDF databases. In contrast to relational databases like SQL, items are not part of any tables. Instead, items are linked with each other like a graph or network:
To describe these relations, we can use a triple(A triple is a statement containing a subject predicate and object):
- Germany (subject) has the capital (predicate) Berlin (object).
- Berlin (subject) has the coordinates (predicate) 3.5million (object).
- The European Union (subject) has the member (predicate) Germany (object).
- Germany (subject) is a member of (predicate) the European Union (object).
If you want to dive deeper into the concept of SPARQL, I recommend this Youtube Video.
-
To get data from Wikidata you simply use triples (like to one above) to write a SPARQL query. Let’s have a look how such a SPARQL query might look like. Note, that we are using specific identifiers to define the right relationship and item:
SELECT ?country WHERE { ?country wdt:P463 wd:Q458. #country #member of #European Union }
-
Here, we simply ask for the countries that are part of the European Union.
-
Do you recognize the subject-predicate-object statement? We just select those countries, for which the condition holds: the country ( ?country ) is a member of (wdt:P463) the European Union (wd:Q458).
Linked Data is structured data interlinked with other data, making it more useful through queries on the internet. It builds upon standard Web technologies such as HTTP, RDF, and URIs. Instead of merely displaying web pages and data for human readers, linked data extends them to be visible and share information in a way that can be read automatically by computers.
One of the visions of the linked data concept is for the Internet to become a global database, making interconnected data widely accessible. Note that no single entity owns or controls all Linked Data. It's a collaborative effort of organizations, individuals, and institutions. By using RDF, websites can provide structured and semantically rich data, enabling better interoperability between different systems and applications.
Linked data is represented by a graph data structure, providing more flexibility at the entity level compared to tabular data representation. Google has Knowledge Graph, and Facebook has OpenGraph for this purpose.
RDF, which stands for Resource Description Framework, is a framework for representing information on the web. It is a specification developed by the World Wide Web Consortium (W3C) for describing resources on the web in a machine-readable way. Note that not every website on the internet might adopt RDF standards.
Linked data concept is built upon RDF triples. An RDF triple follows the Subject - Predicate - Object pattern:
- A. Painter painted A Nice Day
- A Painter has the gender female
For relations and results to be meaningful, each node should be unique so that two different clients will not see different results related to that node.
Every authority has a unique universal resource identifier (URI). Checking Virtual International Authority File (VIAF), we can see that all authorities have a unique ID.
Triple Store is a purpose-built database to store RDF triples.
SPARQL is the language to query and retrieve RDF triples.
Below is a simplified example of an HTML5 code snippet for a biography page that adopts RDF principles using JSON-LD (Json Linked Data)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Biography Page</title>
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@type": "Person",
"name": "John Doe",
"birthDate": "1990-01-15",
"birthPlace": {
"@type": "Place",
"name": "Cityville, USA"
},
"alumniOf": {
"@type": "CollegeOrUniversity",
"name": "University of Example"
},
"jobTitle": "Web Developer",
"worksFor": {
"@type": "Organization",
"name": "Example Company"
},
"description": "John Doe is a web developer with a passion for creating interactive and user-friendly websites.",
"image": "john-doe.jpg"
}
</script>
</head>
<body>
<header>
<h1>John Doe</h1>
</header>
<section>
<h2>About Me</h2>
<p>John Doe is a web developer with a passion for creating interactive and user-friendly websites.</p>
</section>
<section>
<h2>Biographical Information</h2>
<ul>
<li><strong>Birth Date:</strong> January 15, 1990</li>
<li><strong>Birth Place:</strong> Cityville, USA</li>
<li><strong>Alma Mater:</strong> University of Example</li>
</ul>
</section>
<section>
<h2>Professional Details</h2>
<ul>
<li><strong>Job Title:</strong> Web Developer</li>
<li><strong>Works For:</strong> Example Company</li>
</ul>
</section>
<section>
<h2>Image</h2>
<img src="john-doe.jpg" alt="John Doe">
</section>
</body>
</html>
SPARQL (pronounced "sparkle") stands for SPARQL Protocol and RDF Query Language. It is a powerful query language and protocol used to retrieve and manipulate data stored in RDF (Resource Description Framework) format. RDF is a standard model for data interchange on the web, providing a way to describe resources and their relationships in a machine-readable format.
SPARQL provides a query language that allows users to express queries against RDF datasets. These queries can range from simple searches to complex operations involving multiple datasets and conditions.
SPARQL also includes a protocol for communication between clients and servers, enabling applications to interact with remote RDF stores. This protocol defines how queries are sent to the server and how results are returned to the client.
The basic unit of data in RDF is a triple, consisting of a subject, predicate, and object. SPARQL queries are constructed using triple patterns, which specify conditions that must be met by matching triples in the dataset.
SPARQL queries can include variables, which are placeholders for values that will be matched against the data in the dataset. When a query is executed, these variables are bound to specific values found in the dataset, resulting in a set of bindings that satisfy the query conditions.
SPARQL queries can return results in various formats, including XML, JSON, and CSV. This flexibility allows developers to integrate SPARQL queries into a wide range of applications and systems.
SELECT Clause: Specifies the variables to be returned in the query results. WHERE Clause: Defines the conditions that must be met by matching triples in the dataset. Modifiers: Optional clauses that allow users to control the behavior of the query, such as limiting the number of results or ordering them based on certain criteria.
SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object
}
LIMIT 10
This query selects the subject, predicate, and object of all triples in the dataset and limits the results to the first 10 matches.
SPARQL is widely used in the development of semantic web applications, which aim to make web content more accessible and understandable to both humans and machines.
SPARQL can be used to query and integrate data from multiple sources, allowing users to perform complex analyses and gain insights from diverse datasets.
Many knowledge graphs, such as DBpedia and Wikidata, provide SPARQL endpoints that allow users to query the data contained within these graphs and retrieve relevant information.
SPARQL plays a crucial role in the Linked Data initiative, which seeks to interconnect datasets on the web and enable seamless navigation between related resources.