Home » Technical Topics » Knowledge Engineering

Citizen data scientists are a good thing, but are they the only thing?

Aerial view of crowd connected by lines

Data rules the world and, as a result, we want data to drive all the big decisions about how the world works because we believe in rationality, evidence, and science. But most organizations struggle to leverage that data to its competitive advantage. It’s probably even accurate to say that, statistically speaking, almost all organizations fail to get this right.

So, what do we do about this problem? One way of trying to fix it is to decide that the problem is really one of skills and know-how. This is exactly the approach taken by the citizen data science movement, which encourages, among other things, a widespread program of education in applied statistics. Consider Salesforce Einstein, which wants to make everyone a data scientist, not just through education, but also through software platforms and UX that amount to low code or no code, self-serve analytics. With the right UX and tools, their approach is to teach everyone how to perform statistical inference so they can build a more data-driven world…one in which organizations pursue more productive or efficient business strategies. 

Creating widespread numeracy and statistical literacy is not only good for organizations and for business, but it’s also good for democratic societies and for science. It will also have downstream good effects for meeting global crises like the next pandemic and climate change.

Meaning matters more than data

But what if statistical literacy isn’t the only impediment to the goal of transforming how organizations use data to make decisions? After all, that’s what we really want to accomplish. If we could have a world where data drove policy and business outcomes, but most people were really skilled and interested in, say, painting, sculpture, and music, that would be just fine, too. Widespread statistical literacy may not be the most important end in itself.

What if the problem isn’t people but with the data itself? Making people smarter is great, but what if it’s not enough? Statistical techniques are somewhat immune to noise in data, but there are limits and, of course, dealing with bad data is more work even if we have many techniques to compensate. So, what if, in addition to citizen data science, there are other things we can do that will increase our odds of using data to make better business decisions? It’s smart to make people intelligent, but it’s also necessary to bring business meaning to the data. In short, it’s a better idea to fix the data and educate the people than it is to only educate them. 

Data is great but, in some respect, it’s only a mechanism for deriving meaning that is of human significance. Raw uninterpreted data residing in a software system somewhere isn’t very helpful. In fact, it may not be meaningful at all. For data to have value and to actually contribute to strategy, it must first be related to the business context of the enterprise that owns it. In other words, what matters more than data is meaning. “Meaning” in this case is an interpretation of some data that’s relevant or significant to some human project or goal. What matters more than data is what we can use the data to do. So, again, what data means is what matters ultimately, not just that we have the data itself. That’s what we mean when we say knowledge: the primary goal of information technology is to derive actionable insights from mere data, in order to make better decisions.

Let’s consider an example. When scientists at the large pharmas are working to find ways to create new commercial opportunities by repurposing old drugs, they encounter a data meaning problem when searching across dozens of different Laboratory Information Management Systems (LIMS). This is because each one consists of a different — sometimes overlapping, sometimes inconsistent — set of data, schema, and context. Now these are trained professional scientists who are numerate, statistically literate, well-versed in the domain, and so on. The impediment here is not a need for more people who understand mathematics and stats. Rather, the impediment is the data itself is incompatible as to format, schema, representational meaning, and context. Is “Compound A” in system B the same as “Compound 52” in system C? If the answer is yes, then one set of inferences holds and the experiments are successful. If it’s not, then they aren’t. 

But how do these scientists quickly compare what terms, results, compounds, and data mean across these disparate systems? Data management techniques and tools that focus on meaning are an invaluable compliment to citizen data scientists because they focus attention on data meaning in context, data quality and prep, and data connectedness. Humans are very good classifiers, particularly when they can consume data in context and in connection with other relevant data. Seeing the “whole picture” turns many, otherwise quite nasty, problems into relatively simple exercises of human-powered classification. That means that data management and integration techniques matter at least as much as widespread statistical literacy.

Of course, we can and should do both

Most large enterprises are disconnected at the data layer, and most enterprise software systems lack crucial context that leads to real business meaning. Full value isn’t being realized with data, not because we lack enough data scientists, but because we lack enough connected, unified data for data scientists and others to interpret. What people can do naturally is astounding; often we need ML and AI systems not so much to match their insights as to productionize, routinize, and accelerate their insights. 

Enterprises will run more efficiently and achieve more resilience in the face of threats and challenges when there are more data scientists, but we should not fall into the trap of thinking that more statistical literacy, in and of itself, will solve all of our data problems. What every enterprise needs is to invest in more data science literacy but also modern data management platforms and techniques – particularly the kinds that excel at representing business meaning, like data fabrics and knowledge graphs – in order to become connected enterprises. In short, to increase inherent value by understanding what data really means and acting accordingly. Happily, we can and should do both.

About The Author:

Kendall Clark is founder and CEO of Stardog, the leading Enterprise Knowledge Graph (EKG) platform provider. For more information visit www.stardog.com or follow them @StardogHQ.