The Second Coming of XML - DataScienceCentral.com

Far from being dead, XML is gaining new life as a data mapping language.

When XML was first introduced, the W3C XML Working Group took a very unusual step: They created a language for transformations. This effort is now leading to a re-emergence of XML as the need for mapping between data representations becomes more and more pressing.

The Birth of XSLT

XML was (arguably) a simplified form of SGML, a document specification language developed in the 1960s and early 1970s, and one of the key needs of the earlier language was, not surprisingly, to provide formatting for the structural elements that SGML defined.

Put another way, an SGML document (and by extension, an XML document) might define an article with a headline, a byline, a body, and so forth, but the document did not indicate things like page size, which fonts were used, how big margins were or other bits of markup that were potentially critical for displaying the document. The XML effort recognized that CSS played a major part in this role for that ecosystem, but it also made sense to create a more structural language for layout called the XSL, the XML Structure Language. This, in turn, was divided into two components: XSL-Formatting Objects (XSL-FO), which defined things like pages, blocks, footers, and so forth, and the XSL-Transformation Language (XSL-T, though eventually the hyphen was dropped from use).

In the long run, XSLT turned out to be far more significant than XSL-FO because it was a language specifically designed for the process of converting one XML document into another, regardless of what each of those documents was meant to specify. This approach, transforming from one declarative instance to another, is actually an act that happens ANY TIME that you have two different representations of a given object. For instance, if company A represented a quarterly report in one way (one schema) and company B represented a quarterly report in a different schema, You could, in theory, transform one report to the other format with XSLT.

In practice, XSLT started out somewhat underpowered, and it has gone through two major revisions since it was published. The latest version of XSLT, XSLT 3.0, is a considerably more powerful version of the transformation language that supports streaming, deep iterators, embedded text evaluation, maps, functions, schema-awareness, and IO capabilities, among many other features. Because of that, there are few XML structures that it can’t in fact transform. XSLT 3.0 is also recursive and template-based, which can make for powerful transformers, but that also tends to garner resistance from developers who are more comfortable working with iteration.

The Exile of XML and Schadenfreud

In the mid-2000s, developers revolted against the XML stack who felt that XML was overly verbose, complex, and, most importantly, didn’t fit well into the object paradigm that dominated thinking at the time. XSLT in particular drew the ire of web developers in particular because, well, most programs weren’t built around recursion as a central principle. Oh, and angle brackets. Seriously, who uses angle brackets? This culminated in the emergence (and rapid spread) of JSON, and, outside of the enterprise- and government-level publishing efforts, XML was largely consigned as a has-been technology.

In the interim, the JSON community realized that they needed a way of specifying structure and so essentially invented a stripped-down (and anemic) schema language (JSON Schema). They also discovered that when you get complex structures, paths become very important, reinventing XPath over the next fifteen years. Somewhere along the line, there was a realization that namespaces were not as useless as the early JSON advocates had thought. So JSON-LD was created to “simplify” namespaces by building a @context object into JSON. In other words, since its inception, JSON has slowly been turning into XML without the angle brackets, minus the really weird transformation component that becomes very important when data moves outside of a controlled environment.

Several attempts at building transformations for JSON have been made. Indeed, there’s a good case that the plethora of frameworks, from React to Angular and so forth, have essentially been attempts at building transformations into the JSON ecosystems that have met with only limited success.

Dueling Data Models

As JSON was beginning to gain traction, several people in the XML community began asking questions about what it would take to make XML and JSON compatible with one another. This led to ignoring the semantic differences between the two specs and focusing on the XML Data Model (XDM) vs. the JSON Data Model (JDM). This led to a surprisingly small list (simplifying somewhat):

Adding an XML sequence into another XML sequence results in a single, larger XML sequence. Adding a JSON array into another JSON array results in an array embedded in another array.
An XML element can have multiple child elements with the same name in any order. A JSON object can have only one attribute with any given tag, though that corresponding item for that attribute could be an array.
XML has a considerably more fine-grained breakdown of atomic types.
JSON has a “streaming” solution, which involves decoupling an array into a set of lines. XML had a containment restriction, but XML “documents” sequences have been around since the early 2000s.

It turns out that if you incorporate the notion of an array node into an XML document with specific limitations, then XML can represent any JSON out there. However, the inverse is not true: you cannot represent all XML structures in JSON. From a data model standpoint, JSON is a subset of XML, something that was proved in 2013 by David Lee.

Once these discrepancies are taken into account, it is possible to represent JSON as an XML instance by using the XML to represent the underlying data model of JSON, not just the JSON instance. XSLT 3.0 contains an XML translation (json-to-xml) function that builds this map. For instance, consider the following JSON content:

{
  "desc"    : "Distances between several cities, in kilometers.",
  "updated" : "2014-02-04T18:50:45",
  "uptodate": true,
  "author"  : null,
  "cities"  : {
    "Brussels": [
      {"to": "London",    "distance": 322},
      {"to": "Paris",     "distance": 265},
      {"to": "Amsterdam", "distance": 173}
    ],
    "London": [
      {"to": "Brussels",  "distance": 322},
      {"to": "Paris",     "distance": 344},
      {"to": "Amsterdam", "distance": 358}
    ],
    "Paris": [
      {"to": "Brussels",  "distance": 265},
      {"to": "London",    "distance": 344},
      {"to": "Amsterdam", "distance": 431}
    ],
    "Amsterdam": [
      {"to": "Brussels",  "distance": 173},
      {"to": "London",    "distance": 358},
      {"to": "Paris",     "distance": 431}
    ]
  }
}

This gets translated as follows to XML:

<map xmlns="http://www.w3.org/2005/xpath-functions">
    <string key='desc'>Distances between several cities, in kilometers.</string>
    <string key='updated'>2014-02-04T18:50:45</string>
    <boolean key="uptodate">true</boolean>
    <null key="author"/>
    <map key='cities'>
      <array key="Brussels">
        <map>
            <string key="to">London</string>
            <number key="distance">322</number>
        </map> 
        <map>
            <string key="to">Paris</string>
            <number key="distance">265</number>
        </map> 
        <map>
            <string key="to">Amsterdam</string>
            <number key="distance">173</number>
        </map> 
      </array>
      <array key="London">
        <map>
            <string key="to">Brussels</string>
            <number key="distance">322</number>
        </map> 
        <map>
            <string key="to">Paris</string>
            <number key="distance">344</number>
        </map> 
        <map>
            <string key="to">Amsterdam</string>
            <number key="distance">358</number>
        </map> 
      </array>
      <array key="Paris">
        <map>
            <string key="to">Brussels</string>
            <number key="distance">265</number>
        </map> 
        <map>
            <string key="to">London</string>
            <number key="distance">344</number>
        </map> 
        <map>
            <string key="to">Amsterdam</string>
            <number key="distance">431</number>
        </map>  
      </array>
      <array key="Amsterdam">
        <map>
            <string key="to">Brussels</string>
            <number key="distance">173</number>
        </map> 
        <map>
            <string key="to">London</string>
            <number key="distance">358</number>
        </map> 
        <map>
            <string key="to">Paris</string>
            <number key="distance">431</number>
        </map>
      </array>
    </map>  
  </map>

In this case, what’s getting translated is not so much the semantics of the message but the semantics of the JSON syntax. Once in this format, a second transformation can then be applied either to convert this into HTML or to convert it into some other, more semantically meaningful XML output, such as:

<distanceMap>
      <summary updated="2014-02-04T18:50:45">Distances between several cities, in kilometers.</summary>
      <distances>
            <from city="Brussels">
                  <to city="London">322</to>
                  <to city="Paris">265</to>
                  <to city="Amsterdam">173</to>
             </from>
             <from>...</from>
             <from>...</from>
         </distances>
</distanceMap>

It is even possible, using different XSLT transformations (without getting into details here) to output to CSV, different JSON, or RDF formats (such as Turtle):

DistanceMap:_123
          a Class:_DistanceMap;
         rdfs:label "Distance Map";
         Entity:hasLastUpdated "2014-02-04T18:50:45"^^xsd:dateTime.
graph DistanceMap:_123 {
        City:_Brussels City:hasConnection
               [Connection:toCity City:London;Connection:hasDistance "322"^^Units:_Kilometers],
               [Connection:toCity City:Paris;Connection:hasDistance "265"^^Units:_Kilometers],
               [Connection:toCity City:Amsterdam;Connection:hasDistance "173"^^Units:Kilometers]
               .
          City:_London City:hasConnection .... .
        }

In effect, the XSLT3 approach is context-free: the semantics of the source becomes immaterial. For RDF, this is beneficial for several different reasons: not only does it make ingestion simpler, but because there’s standardization of output formats for RDF, you can transform from these formats to different JSON or XML outputs. You can also use these with zip-like compression to create more sophisticated document output, as many document formats are just zipped XML content.

Moreover, this approach can be taken in reverse. By working with the same micro-language, it is possible to create deep JSON by working with the various <map>, <array>, <string>, <item>, <number>, <date> and <null> elements,

Invisible XML as Building Blocks

This atomic decomposition (and utilization of XSLT3) has been attracting a number of different people. In June 2022, Stephen Pemberton published the Invisible XML Specification, which again uses context-free grammars to describe non-XML structures in an XML transformable way.

A great deal of information tends to be caught up in condensed text formats. For instance, consider an address block. Often, rather than trying to capture an address block as distinct input fields, it may just be more efficient to capture an address directly as a string and then use micro-parsers to express this information as an established XML structure. Once encoded in this manner, the details of the address can then be extracted as JSON, CSV, XML, or RDF. In the case of RDF, this microparsing can be used to make inferences on data embedded in long strings or text documents, such as lat-long strings, email content, CSS, etc.

This process is facilitated by a specialized iXML language (not expressed in XML) that performs the relevant parsing. For instance, a typically URL might have an iXML grammar that looks like the following:

url: scheme, ":", authority, path.

scheme: letter+.

authority: "//", host.
host: sub++".".
sub: letter+.

path: ("/", seg)+.
seg: fletter*.
-letter: ["a"-"z"]; ["A"-"Z"]; ["0"-"9"].
-fletter: letter; ".".

An iXML processor could read this grammar and apply it to a URL to generate an XML structure:

<url>
   <scheme>http</scheme>:
   <authority>//
      <host>
         <sub>www</sub>.
         <sub>w3</sub>.
         <sub>org</sub>
      </host>
   </authority>
   <path>
      /<seg>TR</seg>
      /<seg>1999</seg>
      /<seg>xhtml.html</seg>
   </path>
</url>

Once segmented, the results can be further mapped via XSLT3 to RDF, JSON, CSV or similar formats.

What’s the advantage of the Invisible XML approach? Among other things, it solves functional invocations in different languages (javascript, java, Perl, Python, R, etc.), provides a traceable trail for invocations, and can even be used for semantic-oriented search and navigation.

Why Invisible Is Good

XML has a few key advantages, making it especially powerful as a micro-language. It is far more readily transformable (via XSLT) than JSON is, at a time when the transformation of data is critical. The fact that, fifteen years after its inception, JSON does not have a very good transformation language (GraphQL is the closest, and it has many significant limitations), and even SPARQL’s update facility tends to be too linear in comparison to the very recursive orientation of XSLT. I liken this emerging XML to proteins generated by messenger RNA (self-modifying XSLT). Finally, you can express any JSON as XML using XML-encoded JDM. This makes such invisible XML idea for acting as a bridge between data formats and data representations, providing transformations that are easier to write, easier to debug, and more robust than using imperative code on JSON or Turtle.