Why FAIR data assets are essential to AI data management

One of the efforts our Dataworthy Collective will be ramping up in 2024 involves standardizing the building of logical knowledge graphs at the level of the document object. The goal is to make spreadsheets trustworthy, sharable and reusable on a standalone basis at web scale.

Lead Charles Hoffman and others on his team believe that data assets need to be findable, accessible, interoperable, and reusable (FAIR) for data products to realize their true potential as an asset category. The explicit and straightforward formal logic in knowledge graphs makes it possible to scale the trustworthiness and reuse possibilities. Datalog is a useful language for that purpose.

Datalog is a logic programming language, a subset of Prolog. It’s a simple language that can allow users to express relations, facts, rules and queries in databases to be able to create logically consistent knowledge bases. A number of semantic graph database management platforms currently support Datalog, including Oxford Semantic Technologies and TerminusDB.

Logical knowledge graphs

Logical knowledge graphs co-mingle what developers think of as data and logic. This way, the logic that was trapped in applications lives alongside and can grow organically with the data. That logic becomes both machine and human readable, as well as callable from the graph.

This logic doesn’t have to be Datalog-oriented. W3C standards such as Web Ontology Language (OWL) and Shapes and Constraints Language (SHACL), though they refer to reasoning capabilities, don’t refer specifically to programmatic logic. Maybe that’s just a preference of those who have worked on the standards.

Datalog advocates assert that Datalog is more expressive than OWL, for example. Academics I’ve known who work on reasoning applications of knowledge graphs haven chosen Datalog as their default. They seem to have the most confidence in its reasoning capabilities.

I know coders versed in traditional application development can often take exception to “logic” being in with the “data.” I’d counter having logic fully accessible and evolving as the data evolves makes the logic itself more broadly reusable. The more reusable it is, the less the need to duplicate logic, and the more likely to be consistent and contribute to sensemaking. The result? Less complexity and code sprawl.

Why effective AI management starts with knowledge graph-based FAIR data management

In a previous post, I pointed out that the Organisation for Economic Co-operation and Development’s (OECD’s) Council updated its definition of artificial intelligence. The European Parliament then adopted the OECD’s definition, which is as follows (emphasis mine):

An AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments. Different AI systems vary in their levels of autonomy and adaptiveness after deployment.

Note the text above in boldface. AI systems infer how to generate outputs from inputs. In other words, AI systems are entirely dependent on the quality of their data input.

We can talk all we want to about trustworthy models, but when it comes to statistical models being trustworthy, inputs rule. High data quality is a prerequisite. When the input is garbage, the output will be garbage also. Understanding the implications of this fact is critical, not only to data management, but AI management.

To manage AI effectively, managers and their staffs will have to get their hands dirty and work with the data and logic co-mingled in a logical knowledge graph directly. Why? Because AI quality depends on data quality.

The way we know to scale data quality is to use a knowledge graph approach that takes a unitary approach to logic and data evolving together in organic graphs. Machines can assist in many ways, but the process has to be designed and led by skilled humans, and the goal of the spreadsheet object work is to empower everyday businesspeople to be hands on with the data and logic in knowledge graphs.

Contextual computing as a necessary precondition for AI

High data quality at scale implies logical reusability and consistency from context to context. At scale, statistical models can take advantage of a web of reusable, interoperable contexts, a.k.a., interactive digital twins.

For AI to operate continually in the background of our everyday lives, we’ll need to scale contextual computing, just so we can associate the AI with more everyday processes and tasks. We can only accomplish this at scale if the logic models in the knowledge graphs that make sensemaking possible at scale are themselves consistent.

I’ve found it helpful to think of semantic spreadsheets as spreadsheets with an explicit context. Once you’ve expressed that context explicitly, trusting that spreadsheet becomes more feasible, and others can reuse the spreadsheet, perhaps by adapting it to another explicit context.