I’ve been thinking a lot about data, where it comes from, and what it looks like. I can’t help it. I’ve been a data geek for almost 15 years. And I find data beautiful. Not necessarily in its raw form, mind you. Then it’s just messy and more often than not a pain to deal with, especially when it gets really, really big. But when smart, creative people start to clean it up and use it in different ways to find the hidden stories that make sense, it can help us learn things in ways that we never expected. And that can be exceptional thing.
Take for instance Aaron Siegel’s work visualizing traffic flows in Singapore or identifying demographic trends of individuals who sign White House petitions. I heard Siegel speak at last week’s World Bank “Big Data in Action for Development” event in Washington, DC. Clearly he’s a guy using both sides of his brain to tell very compelling, stunning stories with data. Or for that matter look at what Joshua Blumenstock has been doing to turn mobile data into economic insight on usage behaviors. Perhaps it’s not as visually pretty as Siegel’s stuff, but there’s no doubt his novel collection and use of mobile data can help inform decision making in international markets for the better.
As wonderful as these examples are, they only hint at the potential of world data and what we can do with it. Like many of the examples popping up in the open market, they are the result of structured data that is either already formalized into tables of rows and columns in a database somewhere, manually created by legions of patient souls who plug numbers into a spreadsheet, or the output of well-structured metadata. Structured data is likely to be the backbone of and exponential growth in most data-driven storytelling, especially in the current context of the booming market in the Internet of Things. But there is significantly more to explore in unstructured data, especially as it relates to understanding the meaning behind words. Because words are exactly the form in which a large portion of the world’s data comes.
My colleagues and I recently completed a study on foreign data in which we compared the availability and detail of data in several key countries. During the course of the study we were pleasantly surprised to find an incredible richness of content in multiple data types. But the type and usage of data varied considerably by country and culture. For example in Nigeria, where fixed internet is notoriously unreliable and mobile usage grows exponentially, it makes sense to consider approaches that use mobile based applications to actively or passively acquire data for analysis of consumer behaviors. Likewise Indonesian citizens use cell phones considerably more than fixed internet services, which is dominated by Indonesian businesses. As the Indonesian cell phone market continues to grow, mobile-based data collection approaches could be very effective across consumer sectors.
But those same approaches will not be sufficient for appropriate analysis of Indonesian business markets where data is largely text based, whether in the form of articles and other unstructured reports. Nor would mobile based approaches likely work currently in Azerbaijan where internet penetration and usage is low, data available for analysis is almost exclusively in text form, and many Azerbaijanis remain suspicious of technology. Even in Nigeria, approaches that focus solely on structured data would miss the rich and free exchanges and resulting informative data that is available on local websites such as Nairaland. What this richness in text argues for is an integration of approaches that tailors insights to available data and appropriate techniques to collect and analyze structured and unstructured data within the local context.
The fact that data sciences in the open market currently focuses on exploiting structured data is no big surprise. Structured data is much easier to ingest into tools and to figure out algorithms that will work when the content of rows and columns are well-defined numbers. And in contrast to other data science skill sets, there is a much larger population of talent in statistics and computational techniques to exploit structured data and find interesting patterns. Those that have the combined skill set of computational techniques as well as specific business or public sector experience needed to master and apply multiple data science techniques are a significantly rarer breed.
What truly makes this new wave of big data exciting is the potential of bringing together existing computational talent with the disparate, relevant skill that encapsulate the world of expertise in areas that do not easily lend themselves to standardization in columns and rows. For example, how will we extract content, meaning, and context from the written or spoken word in the comments sections of surveys? Can we warn of impending events in different parts of the world by analyzing not just the number count of reports but the actual content of local conversations? How does foreign language usage change meaning? Can we get better answers by asking better context laden questions?
These are just some of the areas of research into unstructured data that have been the subject of intense scrutiny in data science projects and labs for the last several years. And while some have made it to the open market – techniques related to IBM Watson being perhaps the most well-known – the research and potential is still largely untapped. Why? There are many reasons. Unstructured data is really messy. Different words mean different things in relation to each other even when we speak to each other. So imagine a computer trying to translate that into meaning.
Despite current advances, dealing with text is a really hard problem. And it’s made even more so when you consider that almost all of the current machine learning, natural language processing, and other text related techniques for extracting and analyzing text data have almost solely been researched using the English language. Clearly while some techniques are ready for the open market, many others require additional and ongoing funding to explore fully.
Bottom line: as a community of data scientists, we have just begun to exploit the multiple types of data that are available, let alone the methods we can use to capture and analyze global data. And the potential of data science does not necessarily mean we must choose one method in lieu of another. What it does mean is that we should be conscious that we are just at the beginning of understanding how we can tell stories with data in the global context. To that end, we should ensure that whatever data science solution is proposed, that it properly recognize the reality of how users interact with technology and produce data in different parts of the world and the specific needs and business processes that are driving the solution. We should also remain aware that not every approach will work everywhere, and that there is as much to be explored and accomplished with current approaches as there is with emerging ones.