Advantages of Linked Data - gchq/LD-Explorer GitHub Wiki

This page provides a summary of the main advantages of using linked data approaches, with the aim being to quickly give the reader an idea about whether it is worth digging deeper.

Executive Summary

There are essentially five main advantages to adopting a linked data approach. Not all of these are going to be useful to all use cases (in many use cases it might be that none of these are useful).

  • Portabilty
  • Interoperability
  • Flexibility
  • Inference
  • Semantic querying and federation

All of these contribute towards the aim of creating FAIR data and treating data as a product - two modern philosophies around data management that are excellently served by RDF and its associated standards.

The remainder of this document goes into each of these advantages in more detail.

Portability

RDF is often described as an exchange format or a model data interchange and if you go to the W3C RDF home page, you’ll find it described as such. Really though, that’s just a fancy way of saying that you can think of RDF in the same way you think of formats like CSV and JSON - it's a way of getting the data out of whatever storage mechanism you used and transmitting it to a consumer. These various formats for information interchange are each suited to representing a different underlying model:

  • CSV/TSV is for Tables
  • JSON/YAML is for Trees
  • RDF is for Graphs

If you can appreciate the value that formats such as CSV and JSON provide, then RDF is just that same value for graphs. One other great advantage of these formats is that they're portable - you can store them in a file and send them to people, which is something you couldn't easily do if we defaulted to exchanging all of our data by passing databases around.

Speaking of databases, another strange conclusion I often see people drawing is that when you are working with a graph data model, a graph database suddenly becomes necessary. This isn’t true – so long as you can map back to a graph model (e.g. RDF) at the API layer, you can choose any kind of database you want. If you decide to use JSON as your exchange format, that doesn’t force you into a particular database choice, neither does choosing RDF. As an aside - the average JSON API is normally just a short hop away from being a JSON-LD API, which would make it interoperable with all other linked data - it is not necessary to adopt RDF at the data storage layer in order to take advantage of linked data.

We can’t ignore the fact that some data is just intrinsically suited to being represented by a table or a tree, but we also can’t ignore the fact that people often default to these formats purely because they are popular rather than because they’re what's most suitable – and if you’ve ever tried to represent something with relationships in CSV format, you’ll know exactly what we're talking about.

Interoperability

This is about bringing together data from multiple disparate and unpredictable sources and easily being able to put them all together in both a syntactical sense and a semantic sense.

Let’s deal with syntactical interoperability first because it’s easier to talk about but also often overlooked: Two APIs that have both adopted linked data will just slot right together. We don’t have to worry about the shape of our data or about columns not matching – it will just work. So, as simple as this principle may be, it’s completely removed the requirement to manipulate the data prior to appending and means that we can automate the process of merging files together without having to involve engineers. Even if people have used different RDF serialization formats (e.g. Turtle and JSON-LD) converting one to the other is trivial to automate because it’ll be the same process every time we do it, no matter what the underlying data represents.

Semantic interoperability is slightly trickier to explain, so let’s use an example. Imagine I sent you this data with no explanation of what it represented, and you had designs on merging it together with other data:

[
  {
    "id": "1",
    "name": "Oasis",
    "date": 1990
  },
  {
    "id": "2",
    "name": "Brockwell",
    "date": 1985
  },
  {
    "id": "3",
    "name": "Ironmonger Row",
    "date": 1965
  }
]

You would, rightly, be confused about what this data actually was – it could represent anything! Now that it’s just a series of characters on a Github Wiki page, this data has become untethered from whatever meaning or context it once had. This is a situation you can expect to encounter a lot if you’re traversing distributed graphs or graphs that you’ve pulled together from different sources, and that’s not the only problem: whatever these things are, they all have IDs which aren't universally unique – the likelihood of another dataset also containing something with an ID of “3” is high enough that we can’t really accept the risk, so we have that problem to deal with too.

Luckily, because RDF requires nodes and edges to be named with IRIs (Internationalised Resource Identifiers – universally unique strings that are basically synonymous with UUIDs) it means that if two graphs use the same IRI then they you have a guarantee that they’re talking about the same thing.

If I’d used an RDF format such as JSONLD to represent the same data as above and added in a bit more metadata, it would look more like this…

{
  "@context": {
    "sdo": "https://schema.org/"
  },
  "@graph": [
    {
      "@id": "http://www.example/com/1",
      "@type": "sdo:PublicSwimmingPool",
      "sdo:name": "Oasis",
      "sdo:foundingDate": 1990
    },
    {
      "@id": "http://www.example/com/2",
      "@type": "sdo:PublicSwimmingPool",
      "sdo:name": "Brockwell",
      "sdo:foundingDate": 1985
    },
    {
      "@id": "http://www.example/com/3",
      "@type": "sdo:PublicSwimmingPool",
      "sdo:name": "Ironmonger Row",
      "sdo:foundingDate": 1965
    }
  ]
}

With very little extra effort on the part of the provider, this data is now self describing and comes packaged alongside its own meaning – you can now clearly see that we’re talking about Swimming Pools, and because we’ve used web URLs for our IDs then we know they’re unique (because web URLs are unique). You can therefore now say that this data is semantically interoperable because you can give it to anyone else and that person will not need to worry about any of the problems discussed above.

Flexibility

Flexibility, or a lack of it as it relates to information systems, causes problems. At best, an inflexible system can mean that it’s hard to make changes to an established schema without breaking things, and at worst it means we must build entire new applications purely to accommodate differently shaped data.

Consider a blogging application: Normally, there’s a database that’s been built specially to hold blog posts or blog related things (e.g. there might be a “comments” table) and then probably there’ll be an associated API (with endpoints like “GET /posts/1”) and a user interface tailored entirely to the idea of blogging. This setup is not flexible: we could not use this same software to organize our sock collection or create an online shop – to do that, we’d likely need a new database with different tables, a new application, a new API and so on. Wouldn’t it be great though if we could instead build a general purpose application that would work for any type of data and let our users do most of what they wanted to do with it?

People describe RDF as flexible because the core atomic data element within an RDF graph – the “triple” – can be used to represent any type of data and is inherently schema-less. The schema becomes more of an optional “value add” to build meaning around your data rather than being something that we need to provide in full up front before we can do anything. Further to this, any schema details we do provide are specified as data, meaning that they’re more easily changed.

Here’s an example – consider the following three triples – the first two lines represent "some data" and the third line represents “data about our data” (a.k.a. an ontology, or a schema):

:Luke a :Rebel .
:Leia a :Rebel .
:Rebel rdfs:subClassOf :Person .

Now look at these three other triples, which are pretty similar but which talk about a completely different topic:

:Rust a :CompiledProgrammingLanguage .
:CPlusPlus a :CompiledProgrammingLanguage .
:CompiledProgrammingLanguage rdfs:subClassOf :ProgrammingLanguage .

If we were to build an application that allows users to work with and explore data modelled in a predictable way like this, then no matter whether it was the first example or the second example...

  • our storage could stay the same
  • our API could stay the same
  • our user interface could stay the same
  • The way users query data could stay the same
  • ...many other parts of our system could stay the same, maybe even all of it.

It should be noted that this added flexibility doesn’t mean that the complexity of managing a schema goes away – it’s just that we’ve shifted that complexity into the part that’s more easily changed (the data) rather than the part that’s more painful to change (the code and infrastructure). In doing so, we have also potentially put the management of this aspect into the hands of our users and removed the necessity for engineers to be involved every time the shape of the data changes.

Inference

A core part of RDF is the use of IRIs – universally unique IDs – to represent both resources and the properties that connect them. It means, as Google famously put it, we’re talking about “things not strings” and the side-effect of this is that we can tie particular properties to specs, allowing the development of powerful inference rules that can be used to:

  • Surface hidden knowledge
  • Infer new knowledge
  • Check for the existence of contradictions (where you’ve said something that can’t possibly be true)
  • Check whether something is unsatisfiable (where you’ve defined a set of rules that are inherently pointless)

I always feel like it’s worth knocking a bit of the wind out of this whenever I discuss it with people because obviously all the above sound like very desirable things to have, but none of them come for free and it's unlikely that you'll achieve any useful inference without having a good idea of what it is you're trying to infer.

Here’s an example using the “owl” ontology language, which has an “equivalent class” axiom. We use this to describe our data, and that lets us perform some inference:

# Given
:Frank a :Developer .
:Bob a :Programmer .
:Developer owl:equivalentClass :Programmer .

# We can infer
:frank a :Programmer .
:bob a :Developer .

That owl:equivalentClass property is universally unique and is what's known as "metadata" - that is, data about your data. Because of this universal uniqueness, any time you see owl:equivalentClass you can guarantee that it means the exact same thing and is tied to the same spec, allowing you to perform the same inference regardless of the data it's being applied to. If your data contains no metadata at all then you likely can’t perform much inference, and in order to get good stuff out then you need to put good stuff in. The more you describe your data, the more meaning that you pack into your graph, then the more likely it is that you’ll be able to perform inference like this.

You could use inference to perform single, simple operations like the one in the above example and for your purposes, that’s maybe all you need. Alternatively, you could get really advanced and build an all-singing, all-dancing ontology that describes the whole minutiae of how your data should be structured and how it interacts. Alternatively, you could hook into an existing ontology and use all the hard work someone else has already done – in the simplest form this might involve using a bunch of universally unique labels for things like classes and properties, and if you’re lucky then the ontology author might have also put in a bunch of additional metadata that gives you access to the power of inference using their rules rather than ones you write yourselves. Imagine that? You’re storing some data about “something” and you just go and get an established, super-powered, inference-ready schema off the shelf rather than writing your own from scratch?

Semantic Querying and Federation

This seems like an obvious thing to say, but in many information systems the relationship between entities isn’t explicit, it’s implicit. Ironically, this is particularly prevalent in so-called relational databases - you can create two tables and conceive in your head that the records in those tables are related to each other (e.g. users to user_profiles) but in reality that relationship is only established at the point you query the data - up to that point it's just an abstract idea.

The same is not the case in RDF because the relationship between things is right there in the data, explicitly defined. What this gets you, from a “user asking questions” perspective is a more context-aware search that enables users to query based on the precise relationships between things without having to understand any kind of underlying table structure. It also opens you up to using the expressive query language SPARQL, which allows users to look for patterns within graphs as well as ask more general existential questions. Here's some SPARQL that finds anyone who knows :Bob.

SELECT ?person
WHERE { ?person :knows :Bob }

Notice how we don't need to name any tables or do any joins or any of that other horrible stuff we usually shield our users from? And with queries this simple, just think how easy it would be to put this power in the hands of our users, or how easy it would be to abstract it away in a visualisation or behind a general purpose user interface?

Combine this with the interoperability piece discussed earlier and it turns out that through something as simple as “store things as triples, represent resources as IRIs” comes the ability to federate these queries across multiple data sources.