SHACL Guide - gchq/LD-Explorer GitHub Wiki

SHACL (the Shapes Constraint Language) is a W3C recommendation created by the Data Shapes Working Group. The group was chartered to produce a language for expressing constraints for RDF data - something which would allow developers to perform common data tasks such as verifying data integrity (a.k.a. data validation) and publishing the intended shapes within the graph (a.k.a. defining the interface).

In a nutshell - SHACL is a vocabulary that allows you to express validation rules as application-agnostic data. You would then run this data through one of the many available SHACL processors, which would inform you whether or not your data was conforming to your validation rules. Being based on RDF, it also carries standard linked data advantages such as being human readable, interoperable, flexible and portable.

The specification itself was developed over a number of years and finally ratified by the W3C in 2017. The Data Shapes Working Group produced four documents in total before concluding the project.

The SHACL standard itself.
A note about Advanced SHACL Features such as the definition of user defined functions and rules.
A note about SHACL JS Extensions, which are mechanisms that allow the extension of SHACL functionality through the use of Javascript.
The requirements gathering document they used to build the vocabulary, which also documents some of the intended use cases.

A community group was also assembled alongside the working group. Among the things they produced was a proposal for a more compact SHACL syntax.

Prior to SHACL, there had already been several attempts to bring similar functionality to the RDF ecosystem. Of note are several W3C member submissions which were eventually used as inspiration for the final SHACL design; IBM's Resource Shapes Specification, TopQuadrant's SPIN modelling vocabulary and ShEx.

Why would you want to represent validation rules as data?

A core advantage of linked data is it allows data to be untethered from an application and still make sense - the data essentially becomes a general purpose artefact rather than something that only works with some applications. Validation rules are a classic example of something that often gets tightly coupled directly to applications and often duplicated all over an application's codebase. The user interface, the API and the database layer might all need access to the same validation rules and want to keep them in sync with each other. It therefore makes sense that these validation rules should themselves be considered part of the data, decentralised and abstracted into a nice portable format that can be re-used.

It also follows that because they're recorded as data, these validation rules can accompany the data along the wire to remote locations, meaning the governance rules and API definitions go wherever the data goes.

Having the validation rules and API asserted in an application-agnostic, machine-readable format also also allows for more automation at the application layer. RDF is a major enabler for the creation of general purpose applications where data entry and querying could be seen as a general purpose activities rather than bespoke things that every application needs to create from scratch every time.

Overview of SHACL

The core idea behind shacl is that you specify shapes which can be applied to nodes or properties, and then you add constraints to those shapes. Together, these shapes and constraints make up a so-called "Shapes graph" which you can then use to validate data by indicating which resources within the data graph should be the targets of the shapes graph.

Here's some example data describing three individuals. This is what SHACL would consider a "data graph".

@prefix ex: <http://example.org/ns#> .
@prefix schema: <http://schema.org/> .

ex:Bob
    a schema:Person ;
    schema:givenName "Robert" ;
    schema:birthDate "1975-01-23" .

ex:Alice
    a schema:Person ;
    schema:givenName 12345 .

ex:Ted
    a schema:Person .

In the world of RDF, everything is incredibly open and flexible - you could add any new properties to this graph and the data would remain completely valid. This flexibility is fantastic, but becomes problematic from a data governance and integrity perspective (especially if this is data that users are contributing). Let's say we wanted to add some constraints to the above data graph.

We want to indicate that the given name is a required field.
We want to ensure that the given name is a string.

In the case of the data above, there are two issues. Firstly, :Alice's name is not a string, and secondly :Ted doesn't have a name at all. We can write validation rules to pick up on both of these with the following Shapes graph.

@prefix schema: <http://schema.org/> .
@prefix sh: <http://www.w3.org/ns/shacl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://example.org/ns#> .

ex:PersonShape
    a sh:NodeShape ;
    sh:targetClass schema:Person ;
    sh:property [
      # This is a property shape
      sh:path schema:givenName ;
      sh:minCount 1 ;
      sh:message "given name is a required field" ;
    ] ;
    sh:property [
      # This is another property shape
      sh:path schema:givenName ;
      sh:datatype xsd:string ;
      sh:message "given name must be of type string" ;
    ] .

This shape targets nodes of type schema:Person using the sh:targetClass axiom, although this is not the only method of specifying which things the constraint should affect. In the case of sh:targetClass, the SHACL spec also indicates that this targeting axiom is transient, meaning that all sub-classes of schema:Person will also be targeted.

Both the data graph and the shapes graph would then be run through a SHACL processor, which should deliver back a validation report informing us whether our data was valid (and if not, why not) - the validation report itself is also just RDF data! The format of this report is also detailed within the SHACL specification and should look something like this.

[
	a sh:ValidationResult ;
	sh:resultSeverity sh:Violation ;
	sh:sourceConstraintComponent sh:DatatypeConstraintComponent ;
	sh:sourceShape _:n3889 ;
	sh:focusNode ex:Alice ;
	sh:value 12345 ;
	sh:resultPath schema:givenName ;
	sh:resultMessage "given name must be of type string" ;
] .
[
	a sh:ValidationResult ;
	sh:resultSeverity sh:Violation ;
	sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
	sh:sourceShape _:n3888 ;
	sh:focusNode ex:Ted ;
	sh:resultPath schema:givenName ;
	sh:resultMessage "given name is a required field" ;
] .

That's a whole ton of useful information for every item that failed, enough for a us to perform lots of lovely automation and perhaps display some useful feedback to users telling them what was wrong with their data. This feedback can all be customised too within the definition of the constraint itself (factors like the message and the severity).

So SHACL has given us a way to constrain the value of particular properties, but what about if we wanted even tighter control over our data and wanted to limit what people are permitted to say about a particular resource entirely? In the world of RDF, we assume everything to be open and flexible (this is in fact one of RDF's main advantages) but there are lots of good reasons that you might want to disable that flexibility in particular scenarios - there may be certain bits of data, for example, where you want to ensure they do NOT go into your system for compliance reasons.

SHACL allows you to specify that a particular model is closed - if this is the case, then the processor will only allow fields that are explicitly named within the shapes graph. Here's how you'd close down the above graph.

ex:PersonShape
    a sh:NodeShape ;
    ... as per above

    sh:closed true ;
    sh:ignoredProperties ( rdf:type ) .

Note that as well as specifying the shape as being closed, we can also specify an ignore list for particular properties (useful, otherwise we'd have to add a bunch of superfluous extra constraints to indicate that rdf:type is an allowed property). If we add in the above, we get an extra bit of feedback when we validate...

[
	a sh:ValidationResult ;
	sh:resultSeverity sh:Violation ;
	sh:sourceConstraintComponent sh:ClosedConstraintComponent ;
	sh:sourceShape ex:PersonShape ;
	sh:focusNode ex:Bob ;
	sh:resultPath schema:birthDate ;
	sh:value "1975-01-23" ;
	sh:resultMessage "Predicate is not allowed (closed shape)" ;
] .

There are many built in axioms within SHACL for handling most common validation concerns - but SHACL is also extensible and allows you to add in custom validation constraints using either SPARQL or JavaScript. SPARQL constraints are discussed in more detail later in this article.

Doesn't OpenAPI/Swagger already do all of this?

The OpenAPI specification concerns itself with many of the same things that SHACL does, but is aimed specifically at the domain of JSON based Web APIs, whereas SHACL works with multiple data formats (as long as they're RDF compatible) and is agnostic to what it is you're actually doing with your data.

Okay, but doesn't JSON Schema also already do all of this?

Yes, but again that only works for JSON documents whereas SHACL is more general purpose and graph-based. It's also much more expressive and extensible - whilst JSON Schema does allow for users to add new keywords into their schema documents, the caveat is that users should not expect any of these extended keywords to actually do anything when documents are processed. Conversely, the SHACL specification defines several mechanisms for extending the vocabulary which compatible processors will understand, regardless of how bespoke the extension was.

SHACL vs OWL/RDFS

You might be thinking that owl and rdfs already address the problem of validation and interface specification. Both of those vocabularies concern themselves with defining schemas/ontologies - isn't that basically the same thing as defining an interface? Can't we in fact also use a schema to validate our data? Understandably this is confusing, and further confounded by the fact that the word "valid" when relating to the owl vocab doesn't really have anything to do with anyone having broken any validation rules.

The shacl vocabulary, on the other hand, is concerned with what software engineers typically mean when they say "validation" and API specification. OWL and RDFS exist to describe data and to allow for inference by applying reasoning logic, however both are governed by a few assumptions that are problematic when it comes to data integrity checking:

Open World assumption - The assumption that there might be data in addition to what's already in your dataset.
Unique Naming assumption - The assumption that things with different names might actually be the same thing.

These assumptions, both fundamental to owl and rdfs, make those vocabularies ill suited for data-constraining tasks like interface definition or data validation - these assumptions in fact both work to ensure that data is fundamentally un-constrained and this is very much by design. Early novel attempts to shoe-horn data validation functionality into both of these vocabularies involved basically ignoring the above assumptions and using the axioms from rdfs/owl in ways they were not designed to be used.

It wasn't until the creation of shacl that the linked data community finally had a ratified vocabulary to constrain and validate data. shacl operates on a closed world - it makes no assumptions about data that might exist elsewhere. If you assert that someone needs to have a first name, and that first name is not in your data graph, that's the end of the discussion - your dataset fails validation!

Let's take each of these vocabularies (rdfs and owl) in turn and address why they aren't suitable for data validation.

SHACL vs RDFS

We can remove rdfs from the validation conversation fairly easily - there is no statement you can make using rdfs that would make your data invalid . Even if you write something like the following;

:livesIn
  rdfs:domain foaf:Person ;
  rdfs:range schema:Country .

:bob :livesIn :united_kingdom .

You might believe that what you've done here is define a schema that constrains the data you're allowed to use with the :livesIn property - but that is not true. Someone could come along later and add some nonsense like this...

:trumpet a :MusicalInstrument .
:sky a :GeographicFeature .
:sky :livesIn :trumpet .

Nonsense or not, this new data is completely valid under RDFS! A reasoner would infer that :sky must be a foaf:Person and :trumpet must be a schema:Country because that's exactly what you've asserted in your statements. Without abusing the intended meaning of the axioms provided by rdfs, the vocab gives us neither the ability to validate our data nor the ability to define an interface onto it.

SHACL vs OWL

When it comes to owl, there are things you can assert that can make your data invalid and there are certain axioms within the owl vocabulary that sound like they are concerned with validation (things like owl:Restriction, owl:oneOf and owl:maxCardinality). There is a subtle distinction though when it comes to why owl reasoning might tell you that your data is invalid vs what you are actually trying to achieve. Take a look at the following example:

:FlowerPotMan
  a owl:Class ;
  owl:oneOf (:Bill :Ben) .

[a owl:AllDifferent ;
    owl:distinctMembers (:Bill :Ben :Bob)
] .

There are several assertions being made here.

The first set of assertions tell us which individuals make up the class of :FlowerPotMan and also that these are the only members of that class.
The second set of assertions tell us that :Bill, :Ben and :Bob represent different individuals (i.e. they are not just different IRIs for the same individual).

So does this mean we've defined the interface onto this data? We've certainly put our data into a place where it could become invalid, for example, if we were to make the following additional assertion...

:Bob a :FlowerPotMan .

Enough information now exists for a reasoner to tell you that your data is invalid or, more specifically, that your data cannot possibly be true - and herein lies the subtle distinction in what owl means when it says your data is invalid. You could make the following assertion:

:Fred a :FlowerPotMan .

A reasoner would not consider this to be invalid due to the Unique Name Assumption - it might be that :Fred is just another IRI representing the same thing as :Bill or :Ben. Something being invalid according to owl has a fundamentally different meaning to something being invalid due to incorrect interface usage or other such data integrity matters.

Think of it like this.

If owl says something is invalid, it's because there's enough information to infer that what's been asserted cannot possibly be true, rather than because "it doens't follow the correct interface".

In addition, owl also has a very limited vocabulary when it comes to specifying the kinds of things people normally want to specify for data validation, such as the presence of particular fields or the length and type of particular values. Even if you played fast and loose with the owl specification, and introduced a bunch of custom processing to twist it to match your use case, you'd still end up with something much less expressive and much more verbose and difficult to read than if you'd just used shacl in the first place.

SHACL vs SPARQL

SPARQL, the query language standard for interrogating RDF graphs, shares a lot of similarities with SHACL, in particular both of them work by defining patterns to find within graphs (SPARQL calls these patterns, SHACL calls these shapes). The SHACL spec even includes example SPARQL implementations for every built-in constraint, meaning that everything you can express with SHACL can also be expressed in SPARQL (these examples are provided for implementors of SHACL processors).

The relationship between SPARQL and SHACL also goes a lot further - in fact, there are mechanisms outside of the SHACL-CORE specification that allow for the creation of SPARQL-based Constraints. The idea here is that you can create brand new, custom constraints to serve novel purposes.

Here's an example of creating a SPARQL-based constraint to validate that particular resources must be asserted to be wearing a hat in order to pass validation.

    sh:sparql [
      sh:ask """
        ASK { ?s <http://example.org/ns#wearsHat> ?o }
      """ ;
      sh:message "Person must be wearing a hat." ;
    ] .

The ability to use and ship SPARQL directly in your shapes graph means the sky is the limit in terms of what you can validate against - if you can see the patterns in your graph, you can validate against them!

The only caveat to this is that as mentioned above, SPARQL-based constraints are not part of SHACL-CORE meaning that it might not be supported by some processors, so if you're planning to use this feature then this will affect your choices when it comes to deciding which SHACL processor to use.

When to use SHACL

As with all technology choices, when to use shacl really comes down to what it is you're actually trying to do. There is an argument to be made that the use of shacl cuts you off from many of the advantages of modern knowledge graphs, and complex data ecosystems must be built with flexibility and uncertainty as core design principles. SHACL could be seen as bringing the brittle, constrained nature of existing data systems into an ecosystem that philosophically opposes the idea of constraining data.

The flip-side of this argument is that when it comes time to actually consume and use real world data for a particular purpose, it's an inescapable reality that you almost always need to make assurances that your data is all present and correct.

It's not really a case of either/or, but in traditional information systems it's more likely you'll find yourself reaching for the features provided by shacl over those provided by owl.