Data Science and the Open World Assumption

A funny thing happened in the last few years. We began to lose the Closed World Assumption.

Now I can understand that this is not exactly huge, earth-shattering news; most people do not in fact realize that they've been using the Closed World Assumption to begin with. However, I'd contend that this event is having a transformative effect upon the way that we interact with data, one that may very well change the perspective about information in ways perhaps as profound as Ted Codd's introduction of the relational model in the 1970s.

Open the Closed World Doors

In basic terms, the closed world assumption can be stated as "When we model something, our model is complete."  Most people who have had to define a data model recognize that this statement is at best a convenient fiction - any effort to completely define almost any object ultimately comes down to identifying which attributes of that object are relevant to the particular business domain - yet even with this observation, the necessity of restricting attributes is so fundamental to the way that models are designed and built that it is seldom challenged.

Part of the reason for this is due to the programming and analysis tools that derive from the model. It is the rare programmer that designs an application with the assumption that the attributes of the data that he or she is working with may very well grow and mutate over time. With the closed world assumption comes surety; it is not necessary to perform discovery upon a model to find out all of the attributes that are available, instead, semantics are clearly articulated for every attribute, relationships are well established, and there is neither ambiguity nor uncertainty about the interpretation of the data.

Take those assumptions away, and programming becomes much more ... exciting. During the first wave of web services, in the early years of the 2000s, one of the central issues was both discovery about the structure of data and the discovery of data provenance. The solution to the former problem was to create a schematic representation of the data via WSDL, an XML representation of schematic structure that acted as a contract to any requester that what would be returned from a service, given a set of parametric end-points, would validate according to that structure.

Typically, most programmers would then immediately run this structure against a tool such as JAXB that would map the service to a particular method and map the data to a corresponding data structure in Java or similar languages. In other words, rather than deal with the structure in ambiguous terms (using XML navigational tools such as XPath or XSLT to extract relationships that assumed an open world assumption), these programmers would, with great alacrity, convert them back into closed world objects.

JSON and XML and RDF, Oh My! 

Now, XSLT (currently up to XSLT 3.0) is a curious beast. It is a templating language, and works upon the notion that the incoming XML source will have certain patterns, that if caught, can be transformed. In other words, XSLT does not in fact require that an incoming XML satisfies any constraints. This is an especially useful capability to have when dealing with documents, because such documents will, almost by definition, have a great deal of unpredictability about which elements are in it in which order. That it uses recursive processing to do this, typically (though not always) walking down the various branches of a hierarchical structure to do this, points out just how radically different this form of programming is from that of the Java-esque world (and may help to explain why few Java programmers feel comfortable working with XSLT in the first place).

The evolution of Javascript from being a simple client "scripting" language into being increasingly a first class enterprise language (via node.js and the growing number of NoSQL databases such as Couchbase or MongoDB) has followed a similar trajectory. A Javascript object (or JSON object, though the two are not quite the same thing) is a composite of linear arrays and hash maps. There is no explicit requirement for schematic type definitions (though the latest ECMAScript proposals incorporate the ability to specify type characteristics as advisory, rather than required, metadata).

This means in practice that a JSON object has no requirement to be a closed world entity; discovery comes from querying, and actions take place when specific patterns with the query are matched in the source objects. Again, there is an instinctive tendency upon programmers to assert semantics into these structures and work upon the assumption that certain properties are de-facto required, but the reality is that applications built this way are usually very fragile and highly sensitive to changes in data model.

Indeed, mutability has become a key characteristic of modern JSON objects. Objects add and shed properties throughout their processing life-cycle. They take on interim semantics that provide more information about the object within the context of the operation, adding properties and methods seemingly without abandon. There are potential problems with such approaches, namespace collisions being the most significant, but the reality today of most Javascript programming is that it routinely jumps the divide to the Open World Assumption, especially once higher order functions are incorporated into the mix.

RDF is, of course, completely characterised by its abandonment of any Closed World Assumption requirement whatsoever. There are schema languages (RDFS/RDFS+/multiple flavors of OWL, SPIN and other such tools) that can provide constraints on existing RDF data, but these constraints working upon the assumption that all things are possible unless they are explicitly limited. This is actually a major limitation of the XSD families of XML schemas, which assume the opposite - that nothing nothing is possible unless explicitly declared, though there are "outs", such as the <xsd:any> tag, that can violate that assumption. Still, it's noteworthy that for many data modelers who come from an XML background, <xsd:any> is considered a failure of effective modeling, rather than an acknowledgement that no model can be completely described.

Impacts Upon Data Science

The open world assumption has a profound impact upon data science (and provides a very good reason why data analysts should understand the deep vagaries of modeling). One of the key goals of the analyst or statistician comes in understanding what constitutes both the independent and dependent variables that most accurately describe a given model. Put another way, most stochastic methods work upon the open world assumption, even if it's not stated as such.

At the same time, with the rise of "big data", there is a growing amount of data where the models have been explicitly defined. At the same time, however, these models reflect business processes that in many cases fail to account for critical variables (attributes) that reflect why specific phenomena occur. The role of the data scientist in that case is to re-establish context into this data, as this context is critical for any type of predictive analysis. The statistician can look at this information to detect trends, patterns and anti-patterns, but the analyst then needs to be able to ascribe causal agents - weather variability, real-world events, regional or cultural variations, even psychological motivations - to these patterns to move them out of "random" events and into anticipated ones.

What this means in practice is that the models that such analysts build themselves become open-ended, and account for what amounts to the mutability of the data models over time. This doesn't necessarily mean that the models suddenly change their reflection of what is "real". Rather, what emerges is that the data lifecycle itself will likely grow in terms of what is known when, and as this model becomes more complete (and complex), this will change the nature of the analytics performed upon it.

In a similar vein, these more complex models in general are not easily represented via SQL, and many analytics tools that exist today are still very much tied into the SQL paradigm. That is changing, but as the change happens it also provides new ways of looking at information that simply was not easily done in the past. Best practices for analysing data are changing as open models replace closed, and this in turn is changing the very shape of data analytics.

For instance, hierarchical data forms provide a way of more efficiently encoding textual information. The structural metadata is augmented by semantic metadata - data enrichment, semantic tagging and mapping, natural language processing and conceptual inferencing, each of which makes it possible to get more context from what has traditionally been a fairly limited data source previously for analysts and statisticians (beyond fairly simple lexical analysis). This is undeniably open world data, where pattern templates and advisory models replace clearly defined attributes. 


Overall, this open model (or open world) assumption makes the data scientist's job more challenging, but also provides a much richer and more dynamic view of what is, in essence, ways of looking at the real world. Semantics, natural language processing and textual analytics, long a sidewater of analytics, is becoming more critical to the overall modeling, statistical processing and analysis of all data, and in the process is reshaping the field of data science dramatically.

Kurt Cagle is a data scientist, information architect and writer, working for Avalon Consulting, LLC.

Views: 3382


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service