At this point, I suspect a lot of us have heard of the three, four, or even seven V’s of big data. The original three V’s – Volume, Velocity, and Variety – appeared in 2001 when Gartner analyst Doug Laney used it to help identify key dimensions of big data. IBM and others added Veracity. Then Viability, Value, Variability, and even Visualization got included. They definitely all matter, particularly as we consider designing and implementing processes to prepare raw data into “ready to use” information streams.
But there seems to be something missing. It’s as if these V’s were developed to explain things before we fully explored what data is available, where it can come from, why it is so darned big, and why it is going to get a heck of a lot bigger and impact our lives in ways we can’t even imagine yet. Or maybe it’s just that I need more clarity on the Volume or Variety part. Or maybe all seven V’s.
Bottom line: I think we’re underestimating the impact of big data. It’s inevitable at this juncture. I know some enterprising data geeks have tried to quantify it. Take this graphic for instance. Or these projections. And there are plenty of others, all pointing to the inevitable truth that big data is really, really big.
How big? No one really knows. Or can. At least right now. Why? Because it’s not just the number of potential sources that are growing exponentially. It’s also the methods that are growing to exploit them and enable insight. And then there’s the way that the same data source might be used differently based on the insight you’re trying to glean. It’s daunting.
Perhaps the closest we can get to any realistic quantification or definition right now is to first consider the vastness of each of the five data characteristics below:
- Data Type – It’s not just about structured or unstructured data. It’s about understanding what the data really looks like and how you’re going to use it. For instance, think about how much economic data already gets formatted into columns and rows in databases around the world, and what more could come online with new techniques. Then there’s the metadata associated with news reports, blog posts, social media entries, or SMS text messages. And then there’s the meaning, content, and context from each of those. But don’t forget all of that manufacturing or survey data we’ve been collecting all of these years that we can now collect with new tools. Or weather, audio or visual data coming from satellites, drones, or other sensors. Or geo-located data from multiple sources. How about images, streaming videos, points of interest on a map? And we haven’t even begun to discuss all of the data that we’re going to start getting from individual interactions with mobile devices as they become the predominate method of digital interaction. Or all of the data we’re starting to produce from all of those Internet of Things (IoT) devices in our households, or in our cars, or from our heart rate monitors and other medical devices. Or the potential of digital badges. And these are just a few.
- Data Detail – Even if we’re working with just one or two data sources, think about how different system designs will process different levels of detail for each stream and in turn provide different levels of insight. Take Twitter data for example. Some processes will create streams out of counts of Twitter posts about different topics in a given geographical area and time frame to discover emerging trends. Other will take that same Twitter feed and try to analyze other aspects of the metadata, like who is posting, how many times a post gets retweeted, etc. Yet others will work with the feed content and see what meaning can be inferred or sentiment captured. And still others will work to capture and analyze the links to content that users forward in their Twitter feeds. So from one data source, Twitter, there are at least four different ways to process the data and create new data streams, each of which is going to produce a different outcome and different level of insight. Take that and multiply it by the number of all the potential data sources, each of which can likely be broken down to different levels of relevant detail and each of which would use different tools or combinations therein to enable insight.
- Data Periodicity and Timeliness – This really goes to how we’re going to use data and over what period of time. Because ultimately any process will need to fit the end goal or expectations for informational outputs. For example, will the approach only collect data for historical or longitudinal analysis? Or will data be ingested in real time, opening up new avenues for aggregation within increasingly granular chunks of time and space? Will data be used to analyze social networks or current behaviors? Will it produce forecasts or warnings on world events? Each of these processes, and others, will produce their own informational streams. Each might in turn produce data points that capture historical patterns or current trends or be used to forecast future behaviors or events. For instance, if we produce a forecast today for an event that is supposed to occur three months from now, how could we use that data a week or three months from now to develop new information streams on current vs. future event behaviors? How will our forecasts change if we use data at rest vs. data in motion? And how will these streams change in the context of data decay or data drift, or other changes we have yet to capture or understand?
- Topic / Sector Focus – Think about how many topics or sectors you know that haven’t even begun to be tapped for data driven potential? Or that have barely entered into the fray? Manufacturing and financial markets have been big into it for years. So have a lot of technical and other sectors where sensor data dominates. But there are plenty of others that are still trying to figure it all out, especially those that depend on semi-structured or unstructured data for insight. But as the number of tools and methods come on line to capture and analyze data grows, the number of sectors seeking to use different techniques combining structured and unstructured data will also grow. For instance, businesses in the agricultural sector that currently use climate and crop data from satellites or drone might find additional ways to leverage social media, price data from financial markets, or the capacity to monitor emerging news reports on changes in supply and demand factors. Health related industries may increasingly leverage stand-off, mobile based sensors to collect or analyze bio-medical data on different population segments, augment demographic data, or develop methods to monitor and evaluate program impacts on patients. Those in the energy sector might augment utility usage metrics, system performance data, and other data related to how utilities run with additional, emerging IoT technologies and local news reports. And capturing and analyzing economic data might focus on exploring the burgeoning digital economy’s online and mobile transactions to provide insight. Each combination could provide a new source of data or in turn be combined with data coming from other sectors to inform us of things we didn’t even know could matter.
- Geographic/Linguistic Focus – The world is getting more digitally connected. And as it does, so grows the potential to capture and analyze data in regions and in different languages that we couldn’t have imagined before mobile devices became so commonplace. So far, we’ve been quantifying most of our data on the basis of what is available in the Western world, and somewhat in Asian markets. Now, even in the remotest villages in the remotest parts of the world, people increasingly have access to and produce data of their own. And as they come increasingly online, we will be able to collect and analyze more and more data not just at the national level, but at the regional and local village level, even from afar. How will that impact what we understand about the world and how we can help? How will what we know or can do about regional or hyper local issues around the world change and create new opportunities because of our access to data?
Thinking about the vast amounts of data and the potential solutions in any of one of these categories individually is heady stuff. But now think about them in combination with each other. And then think about the exponential growth that is likely to occur from there.
(BOOM! Mind blown. At least mine is.)
So now what? It’s about putting this data to work and finding the streams and tools to extract insight from them that are relevant to different business processes and use cases. I contend that every industry, public, private, or non-profit sector can and indeed should take advantage of data driven solutions now. But designing and implementing any data driven system has to start with an understanding of what you want to do and what is possible with current and future tools. There is no one size fits all in data driven solutions, and every strategy should take into consideration the resources and costs and benefits of different approaches.
Because ultimately, if all of this data can’t help in the decision making process, it’s just more noise.
Originally posted on www.worlddatainsights.com.