Why JSON Users Should Learn Turtle - DataScienceCentral.com

rabbit and turtle. — Rabbit and turtle are discussing the competition.

The Semantic Web has garnered a reputation for complexity among both Javascript and Python developers, primarily because, well, it’s not JSON, and JSON has become the data language of the web. Why learn some obscure language when JSON is perfectly capable of describing everything, right?

Well, sort of. The problem that JSON faces, is actually a pretty subtle one, and has to do with the distinction between something occurring by value rather than by reference. Let’s say that you have an education setting involving three courses and two teachers, where the two teachers co-teach one of the classes. This is a classic data modeling problem, and may start out looking something like this:

{"course":[{
       "id":DM101",
       "title":"Data Modeling Fundamentals",
       "teacher":{
             "id":"JD5182",
             "name":"Jane Doe"
              }
        },
       {
       "id": DM102",
       "title":"Advanced Data Modeling",
       "teacher":{
             "id":"DM103",
             "name":"David Myers"
              }
        },
       {
       "id":DM103",
       "title":"Semantic Data Modeling",
       "teacher":[{
             "id":"JD5182",
             "name":"Jane Doe"
              },{
             "id":"DM103",
             "name":"Dvid Meyers"
              }
        }
}

The description is straightforward until you get to the very last teacher entry. There are two problems here, though it may look like there’s only one. The first is that you have a record that looks like it has a typo, with the teacher name “David Myers” being replaced with “Dvid Meye”. Now, this may look like it’s a simple transcription error (and it probably is), but from the standpoint of the JSON, you now have two records that refer to the same thing twice, along with an illustration that shows why this is a bad thing. In a normal hierarchical JSON file this problem may be multiplied several times, because data usually tends to be self-referential. In other words, it forms a graph. Throw students into this mix, and a straight tree can get complex very fast.

The second problem is more subtle. By sheer happenstance, the keys that were used for identifying courses (all in the DM space) happen to follow a different convention for teacher identifiers. Is DM103 in this case David Myers (or Dvid Meyers), or is it Semantic Data Modeling. In a database, this can happen all the time, because each database keeps its identifiers in a special identifier format with a distinct primary key category. Once you serialize this information, however, you lose that context, and ambiguity creeps into the JSON.

From JSON to JSON-LD

JSON-LD is JSON, but has a few additional rules that help to resolve the by reference problems. It does this in great part by representing data in normal form, which can be thought of as a form in which every distinct object is defined once and only once, but then may be referenced multiple times. Specifically:

A @context property is defined that makes it possible to put data into different namespaces, which can be thought of as discrete buckets of properties based primarily upon functional type,
Globally unique identifiers (URIs) are used to identify concepts and entities.
Datatype associations can be made with atomic data types.

Other items can be referenced from within JSON-LD, so long as they have a global unique identifier. This differs from most modern SQL databases, which assume that referential integrity must be maintained within the scope of the database.

This example is a bit more complex, assuming that each named course may have one or more course instances (i.e., sessions), with two sessions in DM101 and one session in DM102. This also differs from the previous in that it is making use of the schema.org language for structure.


{
  "@context": "https://schema.org/",
"@graph":[{
  "@type": "Course",
 "@id":"https://www.university.org/course/DM101",
  "name": "Data Modeling Fundamentals",
  "courseCode": "DM101",
  "description": "This course looks at the fundamentals of designing and building data structures.",
  "hasCourseInstance": [
      {
      "@type": "CourseInstance",
      "@id":"https://www.university.org/courseInstance/DM101a",
      "identifier":"DM101a",
      "instructor":"https://www.university.org/instructor/JD5182",
      "courseMode": "part-time",
      "endDate": "2017-06-21",
      "location": "St Brycedale Campus Kirkcaldy",
      "startDate": "2016-08-31"
    },
    {
      "@type": "CourseInstance",
      "@id":"https://www.university.org/courseInstance/DM101b",
      "identifier":"DM101b",
      "instructor":"https://www.university.org/instructor/JD5182",
       "courseMode": "full-time",
      "endDate": "2017-06-23",
      "location": "Halbeath Campus Dunfermline",
      "startDate": "2016-08-29"
      }
    ]
  },
{
  "@type": "Course",
 "@id":"https://www.university.org/course/DM102",
  "name": "Advanced Data Modeling",
  "courseCode": "DM102",
  "description": "This course looks at advanced data modeling techniques..",
  "hasCourseInstance": [
      {
      "@type": "CourseInstance",
      "@id":"https://www.university.org/courseInstance/DM102a",
      "identifier":"DM102a",
      "instructor":"https://www.university.org/instructor/JD5182",
      "courseMode": "part-time",
      "endDate": "2017-06-21",
      "location": "St Brycedale Campus Kirkcaldy",
      "startDate": "2016-08-31"
    }]},{

      "@type": "Person",
      "jobTitle": "Associate Professor, Computer Science",
      "name": "Jane Doe, PhD",
      "@id": "http://www.university.org/instructor/JD5182"
    },

    {
    
      "@type": "Person",
      "jobTitle": "Full Professor, Computer Science",
      "name": "David Myers, PhD",
      "@id": "http://www.university.org/instructor/DM103"
    }
]}

What differentiates this second example is that every course, every instance, and every instructor has an associated identifier (given by @id) and type (given by @type). The type identifies the structure, and from that the valid proprety names, while the @id is the primary key that universally identifies the resource in question. Note that there is a distinction between an @id, which is a global identifier, and “identifier”, which is a conventional or local identifier that isn’t likely to be unique (as the DM103 confusion illustrates).

There are a number of different ways that you can build JSON-LD statements, though they also have specific design implications. The importance of the example here, however, is that what is being defined is a graph of multiple objects, not just a simple tree. Sometimes the Turtle language is superior for showing the relationships involved because of that.

From JSON-LD to Turtle

JSON-LD is RDF. That is to say, if you write valid JSON-LD, it can be rendered as an RDF graph. However, JSON-LD does still have a fair amount of ambiguity to it, and sometimes, especially when graphs get complex, it can be hard following it.

There were some inklings of Turtle by 2007 when SPARQL was formulated. In many respects, Turtle made SPARQL possible, but the Turtle specification itself wasn’t truly finalized until 2014 (and it can be argued that it’s ready for revision at this point given advances in areas such as RDF-Star). As such, Turtle is also a relatively young language, even younger than JSON, which was first proposed in 2007.

The previous example can be rendered as Turtle, which might provide some insights into how useful the language could be.

prefix schema: <http://schema.org/>
prefix course: <https://www.university.org/course>
prefix courseInstance: <https://www.university.org/courseInstance>
prefix instructor: <https://www.university.org/instructor>
prefix class: <https://schema.org/class/>
default prefix: <http://schema.org>

course:DM101
        a class:_Course;
        :courseCode "DM101"^^xsd:string;
        :title "Data Modeling Fundamentals"^^xsd:string;
        :description "This course looks at advanced data modeling techniques.."
        :hasCourseInstance courseInstance:DM1a, courseInstance:DM1b;
        .
course:DM102
        a class:_Course;
        :courseCode "DM102"^^xsd:string;
        :title "Advanced Data Modeling Techniques"^^xsd:string;
        :description "This course looks at advanced data modeling techniques.."
        :hasCourseInstance courseInstance:DM1a, courseInstance:DM1b;
        .
courseInstance:DM101a
        a class:_CourseInstance;
        :identifier "DM101a"^^xsd:string;
        :instructor person:JD5182;
        :courseMode: "part-time";
        :startDate "2016-08-31"^^xsd:date;
        :endDate "2017-06-21"^^xsd:date;
        .
courseInstance:DM101b
        a class:_CourseInstance;
        :identifier "DM101b"^^xsd:string;
        :instructor person:DM103;
        :courseMode: "full-time";
        :startDate "2016-08-29"^^xsd:date;
        :endDate "2017-06-29"^^xsd:date;
        :location "Labeath Campus Denfermline"^^xsd:string
         .
courseInstance:DM102a
        a class:_CourseInstance;
        :instructor person:DM103;
        :courseMode: "full-time";
        :startDate "2016-08-31"^^xsd:date;
        :endDate "2017-06-21"^^xsd:date;
        :location "St Brycedale Campus Kirkcaldy"^^xsd:string
        .
person:JD5182
        a class:_Person;
        :name "David Myers, PhD";
        :jobTitle "Associate Professor, Computer Science; 
        .
person:DM103
        a class:_Person;
        :name "Jane Doe, PhD";
        :jobTitle "Full Professor, Computer Science; 
        .

The turtle structures shown in this example are fully normalized. Put another way, with one (fairly important) exception, you could think of each entry (course, courseInstance, and person) as being entries in their respective Course, CourseInstance, and Person tables. The one big distinction is that, unlike with SQL, you can have more than one outbound link associated with any given record,

The example above has also set up a default namespace, in this case the schema.org namespace. Schema.org is a very rich ontology that can describe a surprisingly broad number of categories, including educational elements, making it a useful ontology for working with when data interchange. If you have your own custom namespace, you can use a similar approach to cut down on the amount of coding that’s needed.

A visual image of a subset of the RDF can provide some insight in structure.

wpg_div_wp_graphviz_2

In case you are wondering, this particular graphic is generated in WordPress via DOT, rather than being created in an external application. The DOT application file is noteworthy because it is structurally very similar to the RDF.


digraph G {
node [fontname="Arial" fontcolor="white" style="filled" fontsize="11pt"]
node [fillcolor="red"]
"course:DM101" [label="<course>\nDM101"]
"course:DM102" [label="<course>\nDM102"]
node [style="filled" fontcolor="white" fillcolor="purple"]
"courseInstance:DM101a" [label="<courseInstance>\nDM 101a"]
"courseInstance:DM101b" [label="<courseInstance>\nDM 101b"]
"courseInstance:DM102a" [label="<courseInstance>\nDM 102a"]
node [style="filled" fontcolor="white" fillcolor="blue"]
"person:JD5182" [label="<person>\nJane Doe, PhD"]
"person:DM103" [label="<person>\nDavid Myers, PhD"]
edge [fontsize="9pt" fontname="Arial" ]
"course:DM101" -> "courseInstance:DM101a" [label=":hasCourseInstance"]
"course:DM101" -> "courseInstance:DM101b" [label=":hasCourseInstance"]
"course:DM102" -> "courseInstance:DM102a" [label=":hasCourseInstance"]
"courseInstance:DM101a" -> "person:JD5182" [label=":instructor"]
"courseInstance:DM101b" -> "person:DM103" [label=":instructor"]
"courseInstance:DM102a" -> "person:DM103" [label=":instructor"]
}

Making the Case for Turtle

So why use Turtle? There are in fact a number of very valid reasons.

Turtle is the Language of the Semantic Web. There have been, over the years, a fair number of different representations of semantics, some very minimal (n3), some tied into the rdfs model (XML-RDF), some using different functional representations such as Manchester notation. Yet none of them have achieved the adoption of Turtle within this space by those who have focused on using RDF as a descriptive framework.
Turtle Underlies SPARQL, SHACL and other languages. The Turtle language informed the shape of SPARQL, SHACL and many other RDF stack languages, and as such has become something of a unifying linguistic design.
Turtle is Human Readable. In so far as any data serialization format is human readable while at the same time being terse (indeed, it is more compact than JSON while simultaneously being more precise). It should be noted that the Terseness of JSON vs. XML was one of the primary reasons that developers walked away from the XML toolset in favor of JSON. However Turtle is even terser, while still being able to be more expressive than the JSON language.
Turtle Is A True Streaming Language. JSON ironically suffers much the same issue with regard to streaming that XML does – when you transport JSON, before you can do anything with it you have to reparse it. If the JSON structure is larger, you have to wait until you reach a nature breakpoint (the end of an array or object definition). Turtle, on the other hand, is normalized, which means that the functional parsing boundary for Turtle typically is far smaller than it is for JSON. This in turn translates into the ability to load content into the database considerably faster, and in many cases to perform validation against existing data, meaning that Turtle is a natural language for master data management.
RDF automatically deduplicates. JSON, because it is a denormalized structure, implicitly requires that duplication exists. RDF in general and Turtle in particular, are representations of content that will end up in a triple store index, and the index need only make an assertion about a given statement once. This makes queries very fast, considerably faster than can be achieved with JSON, especially when powered by Graph Processing Units (GPUs) for parallel processing. JSON gains no real advantage when parallel processing is involved.
Turtle is unambiguous, JSON is not. JSON-LD emerged as a standard to work with the semantic web, but it currently has four different profiles, is surprisingly difficult to write well for larger data structures, and as often as not is rejected primarily because it fails to conform to any of the profiles. Turtle has two profiles that have very clear syntax (Turtle and Turtle-Star), with Turtle-star being only slightly less clear because of a few edge cases that haven’t been fully resolved. Additionally, when people refer to Turtle, they are usually actually talking about TRIG, which extends the turtle language to incorporate named graphs (TRIG is backwards compatible with Turtle).
Semantics Matter. JSON requires local knowledge to be useful. Turtle does not, because there are in general enough core conventions in place (RDF, RDFS and perhaps minimal OWL) that meaning can be extracted very quickly through the application of known rules, even without deep contextual knowledge. Because Turtle is built around namespaces, those namespaces can also provide additional bootstrapped metadata to achieve semantic awareness in the data, while JSON requires manual intervention to provide a pale copy of this capability.
Turtle represents complex structures better than JSON. While the Turtle breakdown in the previous section shows a normalized structure, a Turtle parser could reconstruct a fully denormalized Javascript or Python object or JSON document in any number of ways, depending upon what kind of structure is desired. Both Python and Javascript have libraries for parsing Turtle as well (check out FrogCat’s ttl2jsonld at https://github.com/frogcat/ttl2jsonld for just one of many examples). It is this dual nature of RDF – as data documents and normalized data records – that makes RDF so compelling as services become more declarative and end up with less intervention on the part of programmers.

RDF Reluctance

I post this knowing that people who work with JSON are likely to be reluctant to explore alternatives when they have something that works for them. There are also a few aspects of Turtle (or RDF) that add to “RDF Reluctance”, and they are worth reviewing, as these often represent a perceived state of reality that is no longer as applicable.

Real JSON Doesn’t Use Namespaces

When JSON first emerged around 2007, one of the big rallying cries behind its adoption was that JSON dispensed with all of those ugly namespaces that XML had hauled around, making code more difficult to work with while seemingly providing little benefit to the developer. The developer knew what the code was supposed to do, and publishing code samples on github should be enough to tell other programmers what they should expect with a fair amount of introspection. In reality, this usually meant that the author of the component knew exactly what interface was required for working with the code, the users of the component, would need to spend time studying the component, and anyone further out would have to hope that they could reconstruct how something worked from an undocumented data structure.

In reality, a namespace is essentially a contract to tell a user (or a robot) what constituted valid data and why. The namespace is the set of valid terms in the language being defined, while the namespace URI is simply a pointer or identifier that indicated which namespace is being used. When you have only one object, you can ignore the namespace. When you have multiple composite objects, each of which are written by different authors, a lack of indicating namespace can make any project much more complex.

JSON-LD provides a context object, which reintroduces namespaces into JSON. Most of the time, the context item is more or less ignored. This often means that applications don’t scale when they have to move to the enterprise level where dozens or even hundreds of competing namespaces are in use through thousands of microservices. A decent data catalog managing data across an enterprise makes extensive use of namespaces, but those are generally beyond the reach of the developer in most cases.

Namespaces and namespace prefixes exist to simplify shared code. In Javascript, namespaces are generally abstracted away into contextual variables. Some namespaces (such as those of schema.org) are global – everything is in the schema: namespace http://schema.org/. This is handy for predicates (properties) because you can then declare this namespace the default, with tokens such as schema:hasCourseCode being expressed instead as :hasCourseCode. Other approaches, more suitable for modeling, treats each class as being its own namespace, which makes for longer names, but less ambiguity, and makes it possible to show modeling antecedents.

As a final thought on this particular topic, it’s worth understanding that a significant role of semantics is involved in mapping between different ontologies – between different ways of representing information. In that particular case, namespaces are not only useful but very necessary, especially when multiple ontologies are in play at the same time. Doing this in JSON is problematic, doing it in Turtle is not.

I Tried the Semantic Web A Decade Ago, and It Sucked

Ten years ago, Turtle was just beginning to emerge from the primordial ooze, as was the Semantic Web. Most work was done using N3 – lists of URL triples – RDF-XML (which attempted to mix XML, RDF, RDFS, and the rudiments of OWL into a single melange and mostly failed) or other notations which were attempts at articulating a better solution to assertions. Moreover, the hardware, processing (using primarily inferencing), and tool support were all inadequate to the task at hand.

Even today there are older aspects of the Semantic Web that are subpar, but they are being phased out. SPARQL is undergoing a major facelift. SHACL provides a more accessible language for modeling constraints and validating data, and the introduction of RDF-Star is making it much easier to add metadata around data, something that JSON doesn’t do at all well. Moreover, the future of RDF, and Turtle, is clearly in the GPU space, where the kinds of massively parallel queries that RDF really needs can be handled in much the same way that the GPU has changed other aspects of the cognitive computing space. This means that a lot of organizations that played with basic knowledge graph systems in the 2010s are coming back to Semantics in the 2020s as the need for things such as digital twins and augmented reality

The next decade ultimately is about graphs – semantic knowledge graphs, network graphs, scene graphs, graphs of data pods, graphs embedded structures within the computational kernels of deep learning. Blockchains, when you get right down to it, are graphs of relationships tied into global cryptographic identifiers, and Bayesian graphs and their variants are replacing statistical modeling as mechanisms to better understand system dynamics, especially as the interrelationships of personal, social, and economic graphs shift analyses from Gaussian distributions to Pareto ones. This means that languages intended to express graphs, such as Turtle, will become more dominant over time.

Turtle Won’t Help Me Get A Job

Today, that may almost be true. Turtle, and the whole semantic space, is definitely heating up, but it’s still a fairly small space. Tomorrow, it almost certainly won’t be the case. If you had said to someone in 2008 that getting a degree in statistics would guarantee you a job in IT for the next two decades, you’d probably have thought that person misguided, if not necessarily delusional. Certainly, there were pockets of people in bioinformatics who were beginning to understand that something was coming, but it would take the advent of large scale data warehousing and the need to do something with that data that drove a wholesale shift in thinking about the role of data within organizations, and the role of data scientists in determining the signal in the noise.

Not all graphs are knowledge graphs. Labeled Property Graphs (LPGs) are actually used far more in the analytics space, primarily due to the presence of Neo4J. It should be noted that it is possible, with RDF-Star, to replicate LPGs with knowledge graphs (and vice versa). Put another way, if you’re not learning Turtle, you will likely be learning OpenCypher as yet another way of describing graphs, and will still need to learn to deal with graphs themselves.

GraphQL is a major arena of interest for both Javascript and Python developers, yet as knowledge graphs become more pervasive, the GraphQL schemas are being generated not from Typescript (which is useful for JSON stores and Javascript in particular but is definitely non-standard for Python developers) but in SHACL, which is written in … Turtle. Put another way, Turtle is quietly becoming the language of specifications, and as graph technologies become more pervasive, it will become an increasingly important part of the toolkit that developers use to access data.

Conclusion

True data literacy comes from the ability to understand data in whatever transport format it is presented in, be that JSON, XML, Turtle, CSV, or similar languages. These are not programming languages – Turtle will not, and is not intended to, replace Javascript, Python, C#, or whatever your programming language tool of the day is. Turtle is instead a data language, a metalanguage intended to express data so that it can be as expressive as possible with as few assumptions as possible, and it should be in everybody’s toolbox.