Home » Technical Topics » Knowledge Engineering

The Four Principles of Semantic Parsing

  • Kurt Cagle 
Light trails on modern building background and data center servers are on move. Concept big data in motion. Blue toning
Light trails on modern building background and data center servers are on the move. Concept big data in motion. Blue toning

Over the last few decades, I have frequently heard vendors and developers talking about structured data, unstructured data, semi-structured data, and so forth. The arguments about what constitutes each of these categories get fairly vocal, particularly since everyone has a rough intuitive idea about what constitutes structure (tables) and what doesn’t (text). However, I had an epiphany the other day that makes the distinction between the two obvious, and it has nothing to do with the amount of text a given “blob” of data contains.

I’d argue that four principals dictate how data is structured:

  • The Parser Principle
  • The Data Uncertainty Principle
  • The Data Entropy Principle, and
  • The Principle of Deferred Semantics.

The Parser Principle

The first of these, the Parser Principle, describes what we mean by Structured Data.

If a parser exists for identifying components within a block of text (a sequence of characters), then that text is structured.

Let me give an example. One of the oldest tabular formats in existence is the comma-separated value files. Such a file, as a default, uses commas as separators, line feeds as row delimiters, and metadata is provided from the first row. There are a few idiosyncrasies with the format (strings with spaces or commas are usually encased in quotes, for instance). Still, overall, a second-year programming student should be able to write up a program (called a parser) that converts this into a table/row/column structure that exists at the heart of most databases.

One way of thinking about this is that a parser converts a stream of characters into some form of structure, usually involving a metalanguage of some sort. For instance, one such parser might convert a database table into an XML or a JSON stream or perhaps into some internal binary representation.

The XML or JSON are both metalanguages – they use a particular set of conventions to represent the structure, regardless of the precise semantics of that structure. The specific rules for what the property or field tags are is typically specified in a schema or ontology, which is why saying a given file is in XML tells you nothing about the semantics of that data, only that an XML parser will be able to reconstruct an instance of that schema if it is properly formed.

An HTML parser, on the other hand, is not a metalanguage but is instead a set of instructions to a particular program (called a browser or user agent) on how to render the content. Along the way, the HTML Parser creates an intermediate structure in memory called a Document Object Model (or DOM) that makes the data within the HTML stream, and usually, that reflects changes in the rendering if the model is changed. While not all parsers create a DOM, enough do that it’s a worthwhile abstraction to say that a DOM is the internal binary representation of a parsed data stream.

For discussion purposes, however, the exact result of the parsing is irrelevant. What is important is that this output provides a mechanism for accessing data from the transformed content in a way that the initial unparsed content did not. Put another way, a parser increases the semantic content of the string.

The Data Uncertainty Principle

However, this is the tip of the iceberg for data formats. At the simplest end, you may have a string such as “52.78”. This should be immediately recognizable to anyone who has worked with code as a floating point number, right? Well, not necessarily. In the United States, the decimal point character is a period, but in Europe, the same character is rendered as a comma. Moreover, this could be a floating point number, but it could also be a decimal number where every digit is treated as a distinct value rather than an approximation. Indeed, most floating point numbers are approximations, albeit with a fairly large number of significant digits. For precisely this reason, you cannot directly represent the value 2/3 as a floating point number. Finally, there’s the possibility that this represents something altogether different, such as chapter 52, section 78 of some regulatory code.

This points to a corollary of the above rule:

Until a parser is selected for operating on a sequence of text, that text is considered unstructured.

This is the Data Uncertainty Principle.

Put another way, there exists a set of transformations on the sequence of characters that will make them useful for computation. Still, until a specific transformation is chosen, the sequence of characters is not meaningful data. This is a rule that people unfamiliar with data modeling frequently trip over when looking for magical solutions that will automatically parse data from one format to another – until you know the intent and context of the data, it is meaningless, and there could be an infinite number of ways that the data can be interpreted.

In RDF (and in fact in most data modeling scenarios) there’s a concept called a datatype. This data type can be thought of as an identifier for determining which parser is applied to the sequence of characters to make that data meaningful. In the XML Schema Definition Language (XSD), these are called simple types, which I suspect has made people stop thinking that these simple types were anything but simple. In the previous example, there is a very real distinction between “52.78”^^xsd:float, “52.78”^^xsd:decimal, “52.78”^^xsd:string and “52.78”^^lex:codex. They have different representations in physical terms, and they have very distinct semantic interpretations. The string interpretation (“52.78″^^xsd:string) is simply another way of saying this information is unstructured – it’s the identity operation on a sequence of characters.

Look at the last of these ("52.78"^^lex:codex). Chances are you’ve never heard of this particular “simple type”. Why not? I made it up. This expression says that there is a parser we can call lex:codex that, when applied to the string in question, will convert that string into meaningful data. Indeed, we can imagine that a function called lex:codex(), when applied to the string “52.78”, will create a data structure indicating a given chapter and section in a body of law.

<lex:codex>
       <lex:chapter>57</lex:chapter>
        <lex:section>78</lex:section>
</lex:codex>

The exact mechanism used to create this structure isn’t important here. What is important is that by identifying the lexical parser, we have also identified the semantics of the string in question. We have turned randomness into meaningful data.

To a significant extent, people in the semantic community have forgotten this (and for that matter, so have people in the XML community and many other programmers). A datatype or simple type is a mechanism for identifying which parser is used to turn strings into meaning. There are reasons for this in some cases, mostly concerning optimizing sorts in databases. Suppose you know a priori that you are dealing with floating point numbers. In that case, you can create more efficient indexes, which was necessary given the relatively slow performance of databases even a decade ago.

However, sorting is part of what defines a lexer. For instance, in the lex:codex datatype, “4.15” (chapter 4, section 15) will be before “14.15” (chapter 14, section 15), whereas, in lexical terms, it should be after (a concept called a datatype’s collation). By limiting the datatype to only a small number of atomic types, you lose a potentially major source of information about the relative positioning of items in that space.

This does raise a question: what happens when (as with the lex:codex example) one string maps to complex structures. As an example, the string sequence “47.6062° N, 122.3321° W” should be fairly recognizable as a coordinate on a globe, specifically, the location of Seattle, Washington), though there’s no guarantee of it (this could also be a location on Mars, for instance). Specifying it as a string tells you nothing; this is essentially the identity operation. However, by qualifying this as being a coordinate in a given coordinate system (in this case, the WGS-84 coordinate system as specified by the W3C with namespace

PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>

You could then refer to this as "47.6062° N, 122.3321° W"^^geo:Point. Whether this is considered one value or two is irrelevant. You have informed the data environment about how a string of text can be interpreted as a complex structure – if necessary.

The Data Entropy Principle

So why do so? Because most data is unstructured or insufficiently structured for your needs. This is also a principle:

If data is not structured in a way that you can use, then it is no longer data.

Call this The Data Entropy Principle. An email document has the potential to carry a lot of information, and its structure has been largely articulated for more than fifty years. There are many email parsers out there that can separate a stream of text into a body and multiple attachments, identify the various senders and receivers, and so forth. You can even say that there is an email datatype (not just an email address type). However, the semantics of the actual content of each email, beyond the semantics of the envelope and payload, are relatively undifferentiated.

The same principle applies to blueprints which, by dint of how they are designed, can capture a lot of data, but very little of that data is all that meaningful. This is becoming even more pressing in the age of machine learning because we can use image recognition on video to identify people and things, but determining context and intent semantically is far harder and computationally more expensive.

If I know, when I scan in an image (or better, convert it to something like scalable vector graphs), then it becomes possible to say that I know a given line is a thing of a certain length and that if one diagram has that line as 300 cm and another as 3.0 meters that they are the same length (and may potentially be the same thing).

The universe does not generally provide us with keys. We have to infer them and go to some length to ascertain that when two things have different keys, they may still be the same thing (or vice versa). To the extent that we can, in the process of encoding our data, identify the parsers for that encoding, we can also go a long way in determining when keys that describe objects describe the same object and not just simply the same type of object.

The Principle of Deferred Semantics

This hints at one last principle:

If the parser applied to a string is known a priori, then you need only parse that string when the data is needed, then keep only the data that is relevant to you.

This is the Principle of Deferred Semantics.

We do not always need to collect and preserve all information about an event (which is what data is). We do not even usually need to collect and preserve all information about an event. I’d argue that we capture far more information that isn’t relevant for fear of losing something critical that we will probably never need.

If, on the other hand, we know the parser, the datatype that best describes the data in question, we can avoid having to parse the information we don’t need until we do need it. If I have information that says that a given string of data is a news feed expressed in JSON, I can defer the actual processing of that data until it becomes necessary. The advantage of this is that inflating structure (parsing that feed), is likely still far cheaper than trying to recreate the feed as an index somewhere. In essence, semantically, I can index and store searchable keys while still retaining the feed as a blob. If I need greater fidelity in that search, I can always reindex with a more refined parser.

This is important because datatypes themselves are keys that can link to type characteristic metadata – authority, provenance, governance, reliability, and so forth. If I can store that doctype:AnnualReport is expressed as XML, for instance, then even if I don’t know how to parse the exact structure, I can parse it into a format that another process may be able to infer this relevant information. If I have a unit expressed in centimeters, I know that 1 cm is 0.01 m and calculate accordingly.

Applying the Four Principles

We can use parsers (datatypes) to infer relationships with other datatypes, which make the whole far more intelligent, with less and less need for human intervention.

Units:_Centimeter a Class:_Units;
       Units:hasDimension Dimensions:_Length;
       Units:hasUnitScheme UnitSchemes:_MKS;
       Units:hasBaseUnits Units:_Meter;
       Units:hasParser """(cm)=> 0.01 * xs.double(cm)"""^^javascript:_Function;
       Units:hasSymbol """(cm)=> `${cm}cm`"""^^javascript:_Function;
       Units:hasBaseType xsd:double;
       .

Units:_Meter a Class:_Units;
       Units:hasDimensions Dimensions:_Length;
       Units:hasUnitScheme UnitSchemes:_MKS;
       Units:hasParser """(m)=> xs:double(m)"""^^javascript:_Function;
       Units:hasSymbol """(m)=> `${m}m`"""^^javascript:_Function;
       Units:hasBaseType xsd:double;
       .

Units:_Millimeter a Class:_Units;
       Units:hasDimensions Dimensions:_Length;
       Units:hasUnitScheme UnitSchemes:_MKS;
       Units:hasParser """(mm)=>0.001 * xs:double(mm)"""^^javascript:_Function;
       Units:hasSymbol """(mm)=> `${mm}mm`"""^^javascript:_Function;
       Units:hasBaseType xsd:double;
       .

Dimensions:_Length a Class:_Dimensions
       Dimensions:hasBaseUnit Units:_Meter;
        .

Entities:_MyThing1 Entities:hasLength "20"^^Units:_Centimeter.
Entities:_MyThing2 Entities:hasLength "0.12"^^Units:_Meter.
Entities:_MyThing3 Entities:hasLength "300"^^Units:_Millimeter.

This declaration (in Turtle) illustrates the principles discussed here. An unspecified entity has a length of 1 cm, expressed as the parser Units:_Centimeter. It also has the same property specified as 0.01 meters, expressed as the parser Units:_Meter. Due to the principle of deferred semantics, at first blush, it would appear that the two measures refer to different units altogether. However, a comparatively simple SPARQL query can reveal whether two measures are simply variations in units:

ASK {
       bind(dataype($a) as ?a_datatype.
       bind(datatype($b) as ?b_datatype.
       ?a_datatype Units:hasDimension ?a_dimension.
       ?b_datatype Units:hasDimension ?b_dimension.
       filter(sameTerm(?a_dimension,?b_dimension))
       }

This query will return true if both measurements are given as units of length, units of time, or units of some other dimension, regardless of the dimensions. This is essentially an apples and oranges test – you can compare meters and centimeters if you convert one to the other. Still, you cannot combine meters and seconds (Einstein’s Relativity notwithstanding).

The next part – ordering a list of items based upon the results of a parser – gets into some slightly non-standard functions (you can do it in many but not SPARQL implementations):

select ?item ?formattedValue where {
       bind(dataype(?item) as ?itemDatatype)
       bind(str($item) as ?itemStr)
       ?item_datatype Units:hasDimension $dimension.
       ?dimension Dimensions:hasBaseUnit ?baseUnit.
       ?item_datatype Units:hasParser ?parser.
       ?item_datatype Units:hasSymbol ?symbol.
       bind(strdt(spex:jsapply(?parser,?itemStr),?baseUnit) as ?value)
       bind(spex:jsapply(?symbol,?value) as ?formattedValue
       values $dimension {Dimensions:_Length}
} order by ?value
 

The function spex:jsapply() (which is a proxy for any javascript evaluator) applies the javascript parser function to the string representation of the item’s value to get a specific value, then converts this to a double (the dimension’s default atomic type), ordering the output in the right order based upon the calculated value. The Units:hasSymbol property, meanwhile, has a function that generates a string representing the indicated length in the base unit (here, meters).

This output is the table where the items are ordered by length.

Ordered output from multiple length-units.

A similar approach could have put the units within the string, such as “12cm”, then could have resolved the mappings internally in the parsing function. The takeaway is that, however the parsing is performed, using a parsing function as a datatype makes it far more feasible to extract meaning from otherwise undifferentiated text.

Note that there is a trade-off here. The ordering is determined from a calculation, not an index lookup as would be the case otherwise, so it may end up being an order of magnitude slower than an indexed query. However, this may also be a case where once ordering was done once, new triples of the form

Entities:_MyThing1 Entities:hasResolvedLength "0.20"^^Units:_Meter.

could be generated from the calculated values and indexed using SPARQL Update. Restated, you are using deferred semantics to generate new information from the parsed values, then storing these back into the resulting triple stores.

Conclusion

By using literals and bespoke datatypes as unparsed text and the name of parsers that can derive semantics from that text, it becomes possible to significantly expand the semantic information a knowledge graph holds without necessarily having to resolve all of the information of that text at ingest time. This applies to more than dimensional modeling, but dimensional modeling is a good place to explore these concepts in greater depth.