A Glossary of Knowledge Graph Terms - DataScienceCentral.com

As with many fields, knowledge graphs boast a wide array of specialized terms. This guide provides a handy reference to these concepts.

Resource Description Framework (RDF)

The Resource Description Framework (or RDF) is a conceptual framework established in the early 2000s by the World Wide Web Consortium for describing sets of interrelated assertions. RDF breaks down such assertions into underlying graph structures in which a subject node is connected to an object node via a predicate edge. The graph then is constructed by connecting the object nodes of one assertion to the subject nodes of another assertion, in a manner analogous to Tinker Toys (or molecular diagrams). Very complex statements can then be reduced to sets of simple assertions. The RDF framework is considered abstract because it can be expressed using many different (and interchangeable) representations.

Triples

A triple is an assertion in RDF, consisting of a subject (the thing being discussed), a predicate (the particular relation in question) and an object (the item being related to the subject). This is often referred to by the letters ‘s’ for subject, ‘p’ for predicate and ‘o’ for object, with an assertion then being stated as s p o. For instance, the statement (or assertion) “Jane Doe is the current leader of Great Britain” woud be broken down into a subject (“Jane Doe”), a predicate (“is the current leader of”), and an object. (“Greate Britain”). By working with these components in a very generalized fashion, one can create a graph of triples without necessarily having to specify a given context of information, making it easier to manipulate them.

Graph

A graph is a collection of assertions. Assertions are connected when the object of one assertion is the subject of another assertion. Because there is a preferred direction (from subject to object), the graph is called a directed graph, meaning that each edge has a recognized direction. Additionally, because each edge has an associated identifier (what graph theorists call a label) this is also known as a labeled graph.

Knowledge Graphs

A knowledge graph is a graph that is specifically intended to hold a broad amount of information about an organization, domain, or interest. It frequently presents that information in a wiki-like format or via a card-based architecture (such as Google’s Card displays), and frequently acts as a specialized content management system. Not all graphs are knowledge graphs, but knowledge graphs in general are more likely to be RDF based.

Directed Cyclic and Acyclic Graphs

Graphs are considered cyclic when finding a loop where a string of assertions will have the final assertion’s object match another assertion’s subject within that loop is possible. An example of a directed cyclic graph (DCG) would be the steps within a repeating process. If no such loops exist, the graph is called a directed acyclic graph (DAG), usually referred to as a tree or hierarchy.

A depiction of a graph will often look like a mind-map, where each item in an oval in the mind map is a concept, and each arrow between these ovals is considered a relationship. Ontologists (people who work with conceptual data models) usually refer to the ovals as nodes and the arrows as edges. The term node comes from an old Germanic spelling of the English knot, and the collection of nodes and edges form a network (where net also derives from knots) or working of knots. When you traverse a network, you move from node to node along the edges or links that connect the nodes. A path than can be considered as the list of edges that take you from one node to another. RDF includes a path language for specifying such path traversals, with the understanding that multiple edges may have the same predicate label, meaning that there may be multiple paths that connect two nodes in a graph.

URIs, IRIs, and Namespaces

An IRI (International Resource Identifier) is a string of characters intended to uniquely identify a given concept. Typically it will be given as a combination of some identifying authority coupled with a name or identifier that the authority uses internally for that concept. For instance, the UN Terminology Database (UNTerm) defines a standard called M49 (based upon the published “Standard Country or Area Codes for Statistical Use” produced in 1949 for identifying countries. Germany, for instance, has an M49 code of 276. The International Standards Organization also has a system that uses the alphanumeric code “GER” for the same country. Neither of these, by themselves, are unique. However, a URI such as http://unterm.org/ns/m49/1999#276 is sufficiently unique to disambiguate the identifier globally. This is called a Universal Resource Identifier or URI.

The use of IRI extends URIs so that they can handle Unicode character encodings, such as those used in Cantonese.

The ISO country code specification ISO 3166 could also be used to identify the same resource (Germany), and may look like http://www.iso.org/3166#GER. Note that in both cases the same concept, the country of Germany, is being represented, which in turn implies (correctly) that any given concept (thing) can have multiple identifiers that refer to it, based upon context. However, any given IRI can only apply to one thing (resource).

These long URIs are often referred to as qualified names, with the first (namespace) part giving the authority that defines the term and the second (local-name) part providing the term being declared. In the ISO example, for instance, http://www.iso.org/3166# is the namespace and GER is the local name. Sometimes especially in XML or Turtle, the namespace is represented by a shortened term called a prefix that’s mapped to the namespace, separated by a colon. For example, if countryCode stands for http://www.iso.org/3166# , then Germany is identified by countryCode:GER. This shortened form is typically referred to as a Curie (for Compressed URI Entity). Turtle makes extensive use of curies.

Linked Data

When (Sir) Tim Berners-Lee (TBL) first introduced the Semantic Web in 2004, he created an associated concept called Linked Data. The idea was that by creating a common format for describing both information and metadata, one could use data from various triple stores without extensive ETL. Like many efforts of the time, linked data was surprisingly powerful conceptually and yet failed to make many inroads in business, primarily because the technological underpinnings necessary to create a Linked Data environment simply weren’t robust enough.

TBL’s most recent project, Solid, can be thought of as Linked Data 2.0, with more focus on infrastructure and a more robust technology stack.

IRIs, Literals, Blank Nodes, and NodeKind

Semantic objects (the third part of the assertion) can take one of two forms. The first form is when you have a specific globally unique identifier, or IRI, representing a concept (such as Germany). If additional assertions are related to that object, then the object may itself be a subject. For instance, consider the two assertions: Jane Doe is a citizen of Germany. Germany is a Country. Jane Doe, Germany, and Country are all concepts, and the object of the first assertion is the subject of the second.

The second kind of object is called a literal, which can be thought of as a string of characters representing data in some form. Conceptually, the simplest type of literal is a string. Thus Person:_JaneDoe (an IRI) may have a property called given name (Person:hasGivenName) and a literal string “Jane Doe”^^xsd:string, where the “cat’s eyes” indicates a simple type. If Jane was born on January 5^th, 1987, then this might be expressed as

Person:_JaneDoe Person:hasBirthDate “1987-01-05”^^xsd:date.

Literals represent property values (meaning that they are usually leaf nodes with no additional connections), but they don’t necessarily have to be. For instance, treating a date as a subject may make sense,especially if you’re wishing to use dates as a form of index.

Sometimes, you have information that’s linked together as a unit, but there’s no real benefit to explicitly creating an IRI. For instance, you may want to refer to tie together address information into a single block:

Person:_JaneDoe Person:hasAddress [
      a Class:_Address;
      Address:hasStreetAddress “123 Sesame Strasse”^^xsd:string;
      Address:hasCity City:_Koln;
      Address:hasCountry Country:_Germany
    ].

(The notation used here is called Turtle, and is discussed below.)

The address block doesn’t need its own identifier. Instead, the system supplies a specialized identifier called a blank node. Such blank nodes can be manipulated in queries, but can’t be used to directly retrieve a given address by name without referring first to the user’s context. Most triple stores provide inherent support for blank nodes.

The system makes a distinction between IRIs (identifiers for concepts), literals (containers for properties) and blank nodes (structural scaffolding), referring to these three concepts collectively as node kind. For literals, the associated simply type is known as a datatype. Another distinction can be made within literals concerning the language for expressing strings. For instance, you may have “endeavor”@en-us and “endeavour”@en-uk. Both are strings (that is to say, they have a datatype of xsd:string), but they are also seen as different terms based on the language used (also compare “намагатися”@uk, which is the Ukrainian translation of endeavor). The lang property in SPARQL can be used to retrieve this for a given string.

RDFS and OWL

After the initial creation of RDF, a simple language called RDF Schema (RDFS) was created to describe basically logical operations. Almost all schematic languages in RDF rely upon RDFS to describe class and property definitions and declarations, including the IMF schema The Web Ontology Language (abbreviated as OWL) was a language for describing logical relationships in RDF in terms of first-order predicate logic rules. OWL is very powerful as a tool for making inferences and is still used heavily for this purpose, though it is also giving way to SHACL, especially wrt knowledge graphs and similar constructs.

SKOS

The Simple Knowledge Ontology System (or SKOS) predated RDF by a few years and was intended to provide ways of organizing taxonomies, especially Linnaean taxonomies that organize concepts based upon the specificity level given concept. SKOS works reasonably well in cases where most graphs can be expressed as hierarchies but often fails when describing more complex relationships. Still, SKOS remains popular for taxonomies and provides common predicates for low-level vocabulary relationships such as synonyms, antonyms, and related linguistic structures.

SKOS has an extension called SKOS-XL that creates an intermediate association between a label and its associated string text.

SPARQL

SPARQL is a query language for RDF, specifically designed to look for patterns within graphs and then either retrieve variables derived from those patterns or create new assertions based upon an existing selected graph. Its most recent version is SPARQL 1.1 (Finalized in 2013), though it is likely that a SPARQL 2.0 will be fast-tracked within the next year based on changes in the space. SPARQL is very roughly analogous to SQL for relational databases but is generally more expressive and capable of retrieving and processing graph information from both within and outside of the triple store environment.

SHACL

SHACL is the Shape Constraint Language, a W3C RDF language intended to express more schematic-oriented constraints than the more inferential oriented OWL language. (OWL predates SPARQL, while SHACL has some dependencies on it). SHACL is used to identify and validate data shapes within RDF and is similar to other languages such as the XML Schema Definition Language (XSD). It can be used to validate specific graphs as satisfying certain shapes and determine how a graph fails to validate the shape, which is important in determining compliant issues. Extensions to SHACL can be used to define functions within SPARQL in a consistent manner, and SHACL can be used to generate interfaces.

Turtle

The Turtle language is a neologism formed from the abbreviation for Terse RDF Language. Turtle evolved out of the necessity of having a terser language than XML-RDF (the original semantic language), and it emerged in parallel with the SPARQL language. It has become the primary language for expressing semantic information and data modeling and has even impacted JSON development.

RDFa and GRDDL

RDFa is a way of using RDF via HTML or XHTML attributes to describe the content of a web page or article. It has quietly replaced microformats for annotating specific blocks, though JSON-LD is arguably replacing RDFa. GRDDL is a processor written to extract RDF written in RDFa.

JSON-LD

Even as Turtle was making its way onto the scene, there were multiple efforts to create a JSON format that Javascript and Python developers could universally use to represent linked data, primarily in ways that are more familiar (and are more useful) for Javascript developers in particular. The JSON-LD specification was one such format, though it represents five different, somewhat overlapping profiles.

Perhaps the most significant innovation of the JSON-LD format is the introduction of the Context object, which provides a JSON-friendly way of declaring namespaces. Traditionally, namespaces have been problematic for JSON users because it seems to complexify the simplistic format that JSON promised. However, once you end up with multiple authorities creating terminology, namespaces become unavoidable.

Inferencing

An inference uses logical rules to surface (or establish) new information from existing assertions that follow a given pattern. For instance, consider the following statements:

Person:_WendyDarling  Person:hasParent  Person:_MaryDoe.
Person:_MaryDoe Person:hasSibling Person:_ElizabethDoe;
                             Person:hasSibling Person:_SarahDoe;
                             Person:hasSibling Person:_ThomasDoe.
Person:_ElizabethDoe Person:hasGender Gender:_Female.
Person:_SarahDoe Person:hasGender Gender:_Female.
Person:_ThomasDoe Person:hasGender Gender:Male.

Because the female sibling of a parent is an aunt, you can create a definition in SPARQL that looks something like:

construct {
     ?person1 Person:hasAunt ?sibling.
     }
where  {
     ?person1 Person:hasParent ?parent.
     ?parent  Person:hasSibling ?sibling.
     ?sibling Person:hasGender Gender:_Female.
     }

This creates two assertion:

Person:_WendyDarling Person:hasAunt Person:_ElizabethDoe;
Person:_WendyDarling Person:hasAunt Person:_SarahDoe

These relationships were inferred based upon pre-existing rules (in this case the Where clause of the SPARQL statement).

Prior to the advent of SPARQL, RDF used internally defined inference rules and blank nodes to identify relationships. The triples that were “realized” by this process utilized OWL rules for defining relationships, The difficulty that occurred with such inferencing was that the triples that were produced frequently overwhelmed the available memory at the time, and because such inference rules were usually baked in (typically involving RDFs or certain Owl primitives) what was inferred appeared to do so magically, and hence inconsistently from one platform to the next.

While inferencing is still used in knowledge graphs the use is declining as the SPARQL/SHACL stack becomes more powerful and wide-spread.

GraphQL

An alternative approach to knowledge graphs was introduced by Facebook (now Meta) based upon a pattern matching language called GraphQL, which allows developers to create GraphQL queries against a JSON store. A number of GraphQL interfaces have been developed for knowledge graphs, however, based primarily upon SHACL rules that in turn generate GraphQL schema language content. While GraphQL is not, by itself, part of the W3C stack, it is becoming a fixture of graphs in general.

Summary

No glossary is going to be truly comprehensive, but this should provide an overview of the more common terms in the knowledge graph space. Please direct any comments and questions to [email protected].