This is the first in a series about cross format data modeling principles.
In the data modeling realm, there is perhaps no example that is as ubiquitous as modelling personal names. After all, things don’t get much simpler than a name:
Simple, right? Well, not so fast. This isn’t really a model, but rather an instance of a model - an actual example that proves out the model. There are, however, a number of ways that such a model can be described. If, for instance, we use a language such as RDF, this would be modeled as follows:
What you see is a set of assertions that identify that there exists a class named “person”, and that this class has two properties. The domain and range assertions on each property are important, because they indicate what class the property is used on (the domain) and what is expected as the value of the property (the range).Note also that <person> is a class, but <person1> is an instance of that class.It’s important to note that this description is in fact agnostic with regard to a physical representation. For instance, the model above could just as readily identify JSON or XML, as is shown in Listing 3.
So far, so good. Now, suppose that Jane decides to get married, and changes her name to her husband’s lastName, (he’s James Dean). This is where modelers face a conundrum. You could of course simply change the name, but what if you wanted to actually track all of the names that a person has.This occurs all the time, by the way - people exist over time, and things change.
Most relational database designers have to make a conscious decision about whether to create a new table or not, but this is again a physical model issue, not a logical one. You also can’t just say you have a new firstName1, firstName2, etc., because that makes a limiting assumption about how often people change their name. Instead, the usual solution is to bite the bullet and create a table (or, in modeling terms, set up a first normal form).
This may solve one problem - acknowledging that some items can have cardinality (the range of potential numbers of items) beyond 1 - but there’s a much more subtle issue that can cause a great degree of grief if not properly handled.
The model given above makes no assumption about the order of items - if you have three items, the system will not preferentially return them in the order that they are entered (e.g., (“a”, “b”, “c”, “d”) could just as readily be returned as (“c”, “b”, “d”, “a”)). In order to facilitate this, it becomes necessary to dig a little deeper into the data model.
An array is an abstract data structure, though one used so often that people tend to forget that it is in fact an abstraction. I can describe that array at a data model level by defining a couple of classes and associated properties:
This may seem like overkill, but in practice what it does is provide an abstract layer that says that anything that is a subclass of an array can hold anything that is identified as a subclass of an array item. I use the notation <_index_> to indicate that these would likely be “consumed” by the conversion process to a given format such as JSON or XML (these would likely be in a different namespace if we used them here, rather than using a potentially breakable syntactic convention).Now, with the above defined, it becomes possible to change the model from above:
That may seem like a lot of work, but there are some major benefits to be gained by going into the deeper model. This creates a consistent data model for SQL, XML and JSON, making each of these interchangeable. Listing 7 illustrates how each of these gets rendered through a semantically neutral process:
The two JSON formats represent two different approaches, the first being a class key oriented approach, where the class name acts as the key, while the second takes an identifier approach, which treats the class as another attribute (given by _type_). In the XML format, the _type_ information is given by the XML tag, while _id_ is treated as an attribute.So what happened to the indexes?
Remember that both XML and JSON do respect ordering within a component, the first as an implicit (albeit due to a historical artifact) characteristic of the language, the second through the explicit array structure. This means that the ordering is used to determine how the lists are reassembled, but once this is done, their presence makes it possible that the implicit ordering and an explicit attribute can get out of sync very easily.
SQL and CSV-like outputs would retain the indexes, because they don’t have that explicit guarantee of ordering.The SQL approach produces an extra table that has a one-to-one relationship between (Person) and (PersonNames). This redundancy helps to preserve round tripping. The rather awkward personKeyRef and personNameKeyRef are ways of making it clear that these are the pointers to the array and base elements respectively, since SQL can only incorporate inbound links when dealing with references to objects (RDF is not limited by this restriction, which is one of the more compelling arguments for using an RDF triple store).
The above points to one deficiency with the way that most data is modeled - the fact that we often fail to think about the fact that a “table” is most often the representation of an entity, and that entity changes over time, necessitating changes in cardinality. It is as often as not cardinality changes, rather than needing to add new properties to a model, that cause the biggest integration headaches, because such changes necessitate the creation of new tables.
In effect, there is a tug of war occurring here between normalization and denormalization. NoSQL formats such as XML and JSON are intrinsically denormalized, meaning that is is trivially easy to create deep nested structures, and there is no real penalty for going from having a named field hold one value and one that holds two or more. SQL, on the other hand, incurs a fairly significant penalty - you have to construct a new table any time you change cardinality from "1" to "many", and an "0" (optional) value requires that you store an arbitrary NULL value, and if you do go from one to many, all of your pointers change from outbound pointers (pointers that originate in the source (reference) template and end in the target (child) template), to inbound pointers (pointers from target to source).
This last point is significant, because key representations no longer represent semantic relationships. A person has one or more person-names is an outbound relationship (and represents what modelers call a composition). However, SQL can only say "a person-name belongs to a person", which is a weaker relationship (called an association).
In a graph language, this distinction is useful because it clearly identifies when a given data structure is explicitly bound to an entity - if I remove that person from the data structure, I know that I can also remove its compositional components, because they have no real existence except relative to their associated person. Because SQL forces a distinct direction for a relationship, determining whether something is a composition or an association becomes a human activity, rather one that can be determined programmatically. This in turn makes automated integration FAR harder.
One of the major advantages of RDF is that you can model things more naturally, and because of that its easier to set up relationships which can be cleanly automated - and that provide the ability to go from a hierarchical structured format (NoSQL) to a table based flat structured format (SQL). I like to call this process "zipping". The unzipped form is the normalized one, where you have flat tables and relational keys (SQL or CSV), while the zipped form is demormalized, an XML or JSON structure. I hope to discuss the process of zipping in part two of this series.
As you may have noticed, the discussion here talks about names because they usually hide a lot of unstated assumptions, but the reality is this is as relevant to most data structures that you have within enterprise level ontologies. The bigger take-away here is to understand that we're moving into a world where data conversions and data integration dominate, and the kind of thinking that wants to just add a couple of fields to an object to represent a name or other singleton entries (addresses, contact information, emails, companies, stores, the list is pretty much endless) is likely to get you into trouble, especially when data worlds collide.
In my next article in this series (check back here for the new link) I’ll look at the temporal aspect of change, as well as exploring controlled vocabularies and how they fit into contemporary data modeling.