The Shape of Data - DataScienceCentral.com

Rollenspielwürfel — What shape does your data take?

In the last few years, a very interesting concept has emerged from the realm of semantics: data shapes. This idea has been touched upon elsewhere – constraint modeling with Schematron in XML (much of which was later absorbed in the XSD 1.1 specification) and even to a certain extent in OWL. However, shapes have for the most part emerged in conjunction with SHEX, which would in turn eventually become the SHACL specification of 2017.

OWL emerged first in the RDF firmament as a way of performing logical operations (known as inferences) on classes and properties. OWL made use of blank nodes following certain patterns to act as variables that could then be used with RDF assertions to create other triples, a process called inference, with the logic handling these relationships known as rules. By 2007, an alternative approach was developed called SPARQL, based partially upon another language called Turtle (the Terse RDF language), which had been around since about 2006 but wasn’t fully standardized until nearly 2013. The second version of SPARQL (1.1) debuted at the W3C at about that time, solving a number of issues with the first version, along with a more formal SPARQL Update language.

Standardizing on Shapes

SHACL (the SHApe Constraint Language) originally started out as a way to align the schema definitions used in XML with the structure of RDF. As such, a lot of the core language focused primarily upon things like defining class structures and property structures, identifying constraints and defaults, and reporting when given instances failed to conform to these shapes. Yet somewhere along the line something fundamental emerged – the notion that maybe, just maybe, classes and properties were not in fact the only way of looking at data.

For instance, consider a description of a cat in Turtle.

PREFIX Pet: <http://www.example.com/ns/Pet/>
PREFIX Cat: <http://www.example.com/ns/Cat/>
PREFIX CatBreed: <http://www.example.com/ns/CatBreed/>
PREFIX Class: <http://www.example.com/ns/Class/>
PREFIX Person: <http://www.example.com/ns/Person/>
PREFIX Gender: <http://www.example.com/ns/Gender/>
PREFIX Units: <http://www.example.com/ns/Units/>

Cat:_BrightEyes a Class:_Cat;
       Cat:hasBreed  CatBreed:_RussianBlue;
       Pet:hasAge  "5"^^Units:_Years;
       Pet:hasOwner  Person:_JaneDoe;
       Pet:hasGender   Gender:_NeuteredFemale;
        .

This is admittedly a very simple example, intended primarily to show off what SHACL would look like given this:

PREFIX Shape: <http://www.example.com/ns/shape#>
PREFIX Pet: <http://www.example.com/ns/Pet#>
PREFIX Cat: <http://www.example.com/ns/Cat#>
PREFIX Dog: <http://www.example.com/ns/Dog#>
PREFIX Gender: <http://www.example.com/ns/Gender#>
PREFIX CatBreed: <http://www.example.com/ns/CatBreed#>
PREFIX Units: <http://www.example.com/ns/Units#>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX sh:   <http://www.w3.org/ns/shacl#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>

#Cat:_BrightEyes a Class:_Cat;
#       Cat:hasBreed  CatBreed:_RussianBlue;
#       Pet:hasAge  "5"^^Units:_Years;
#       Pet:hasOwner  Person:_JaneDoe;
#       Pet:hasGender   Gender:_NeuteredFemale;
#        .


Shape:_Pet a sh:NodeShape;
    sh:targetClass class:_Pet;
    sh:property [
        sh:path Pet:hasAge;
        sh:nodeKind sh:Literal;
        sh:datatype Units:_Years;
        sh:minCount "0"^^xsd:integer;
        sh:maxCount "1"^^xsd:integer;
    ],
    sh:property [
        sh:path Pet:hasOwner;
        sh:nodeKind sh:IRI;
        sh:target Class:_Person;
        sh:minCount "0"^^xsd:integer
    ],
    sh:property [
        sh:path Pet:hasGender;
        sh:nodeKind sh:IRI;
        sh:target Class:_Gender;
        sh:minCount "1"^^xsd:integer;
        sh:maxCount "1"^^xsd:integer
    ].

Shape:_Cat a sh:NodeShape;
    sh:targetClass Class:_Cat;
    sh:property [
        sh:path Cat:hasBreed;
        sh:nodeKind sh:IRI;
        sh:target Class:_CatBreed;
        sh:minCount "0"^^xsd:integer;
        sh:maxCount "1"^^xsd:integer;
    ].

As a brief explanation, the SHACL given here is a simple schema, though across two different shapes. The first shape defines a very simple generic Pet – a creature that has an age (measured in years), a gender, and an owner. The second shape identifies that a Cat has a particular cat breed, which is apparently different from other breeds. The node shape makes the observation that a Cat shape and a Pet shape are bound to their respective classes, but what is not considered here anywhere is the observation of how a Cat and a Pet are related.

SHACLs constrain and validate, not necessarily organize

However SHACL is a little different. It does describe at least some RDFS relationships (for instance, you can sort of see the path relationships that would be roughly analogous to the rdfs:domain and rdfs:range properties. But that’s a little deceptive, as path here actually specifies a full path, not necessarily just a single hop. For instance,

sh:path skos:narrowerTerm+;

indicates that the shape is satisfied when there is at least one but potentially many terms at an arbitrary depth away from the reference node. This is especially useful when dealing with structures like RDFLists that largely traverse blank nodes. Similarly you can add conditionals such as

Shape:_IsNeuteredCat a sh:NodeShape;
sh:property [
        sh:path rdf:type;
        sh:nodeKind sh:IRI;
        sh:hasValue class:_Cat;
    ];
sh:property [
        sh:path Pet:hasGender;
        sh:nodeKind sh:IRI;
        sh:target Class:_Gender;
        sh:in (Gender:_NeuteredMale,Gender:_NeuteredFemale);
        sh:minCount "1"^^xsd:integer;
        sh:maxCount "1"^^xsd:integer
    ].

This might come up if the hasGender property of Shape required must be either “neutered female” or “neutered male”. You might see this, for instance, when filling out an application for a house loan. This differs from a schema. A shape need only have enough information to identify a particular pattern (it’s a cat). Once identified that pattern (gender) then has specific values. You might also have multiple shapes that apply to the same class, as is shownbelow.

Shapes vs. Classes

Notice, that this doesn’t really change the definition of classness. That is to say, I can’t assert within SHACL that a cat is a subclass of a pet, i.e.,

Class:_Cat rdfs:subClassOf class:_Pet.

Instead shapes generally describe constraints, rather than formal logical assertions.

The other distinction between a class and a shape is simple. A class is a label for a set of attributes and constraints that have been satisfied. A shape, on the other hand, lets you identify the class of an object if it satisfies these constraints. As importantly, the shape will tell you why something fails to be in that class. If you understand why something fails to comply, you will end up with more insight than classification alone. Put another way, data shapes make it possible to determine why the lack of compliance occurs.

Again (as a very simple example), let’s say that you wanted to test a node against a SHACL pattern. The following shows how this might look:

#Namespaces here
Shape:_IsNeuteredCat a sh:NodeShape;
sh:property [
        sh:path rdf:type;
        sh:nodeKind sh:IRI;
        sh:hasValue class:_Cat;
    ].      
sh:property [
        sh:path Pet:hasGender;
        sh:nodeKind sh:IRI;
        sh:target Class:_Gender;
        sh:in (Gender:_NeuteredMale,Gender:_NeuteredFemale);
        sh:message "The referenced pet was not one of neutered male or neutered female.";
    ];
sh:property [
        sh:path Pet:hasGender;
        sh:nodeKind sh:IRI;
        sh:target Class:_Gender;
        sh:minCount "1"^^xsd:integer;
        sh:maxCount "1"^^xsd:integer;
        sh:message "The gender must be specified once and only once.";
    ].

This is example actually illustrates how versatile SHACL can be. The same property has two different sections. The first passes a message to the report if the cat is NOT neutered. The second case, on the other hand sends the message to the report if there isn’t one and only one “has Gender” property.. These apply to the same properties, but they capture very two different conditions.

Summary

In the next article in this series, I hope to explore the reporting side of SHACL in greater depth. Additionally, I’ll show how to actually use SHACL for validation and reporting both in SPARQL and externally.