If you’re planning on a data driven approach, you’re going to need data, most likely from a nice mix of sources that will give you the right amount of insight you need to make it worthwhile.
But if you’re not careful, you can quickly spend a lot of money getting it whether it’s “free” or not. Because as much as we’d like it to be, most of it is definitely not free, even if it is on an open and public site.
Where can these costs come from? How can they escalate? Here are nine of the most common ways that data can quickly overwhelm your budget if you're not careful.
1. How many geographic regions do you want to cover? How local do you need to go?
Every dataset has a specific geographic range that it is supposed to cover. Conversely, no dataset is truly global in coverage. And the level of available data coverage for different geographic zones can be highly uneven depending on the type of data you want to get. For instance, if you’re planning on using yearly data on standard economic indicators for different countries around the world, you’ll have no problem finding free public data. Just go to the World Bank Data website. They’ve got plenty. But if you need more localized or timely data from far off places to make your project work – say for example, daily data on how many and the prices at which people are buying or selling plum tomatoes in the local markets of Naples (Italy), Timbuktu (Mali) or Yerevan (Armenia), or Austin (Texas) - you should be prepared to deploy more resources.
2. What kind of level of detail or content granularity do you want the data to have?
Do you want a cursory, macro-level view of your data world? Or does your project need really detailed data to get insight to your customers? Because there are different levels of granularity that you can get to with different techniques. For example, statistics or metadata on aggregate mobile phone usage in a particular area might be made readily available at relatively lower cost. But there are resource implications for most mobile based data driven solutions, not the least of which can involve developing agreements with local mobile carriers on how data will be used, resolving privacy concerns of individual phone users, and considering how individual phone or text message costs per carrier will affect your ability to collect data. Similar issues exist with text based approaches. It's easy to find data that helps you see on how many times customers have mentioned being “angry” in proximity to where the name of your product shows up in social media. But getting more nuanced and detailed data on the meaning of what the authors of those posts were angry about when they mentioned your organization will need more resources to implement, at least in the short term.
3. How many sources or types of data do you want to use?
Even when dealing with one source, the volume of the data and the velocity at which it travels can quickly overwhelm your collection process. When you start adding different data sources, even if the type of data looks the same, processing issues can escalate even faster. But common issues such as merging records across different databases, removing duplicate records, and reconciling content can quickly become more complicated when data becomes really big and really fast. And when bringing in different data types, for example matching data from your database with specific tweets in a Twitter stream or with the specific reporting of an outbreak event in the news, you have to make sure that your processing systems can handle that additional complexity. (Dan Hirpara has a great blog posting about this on our website).
4. What kinds of collection processes are you going to have to use to get your data into your system?
There is data, and then there is data. Depending on the level of detail, the cleanliness of the data, and depth of the insight you want, there will be varying costs. Purchasing data is one option which may have lower processing costs. (Maybe.) But given what is increasingly out there as data, there are not as many data vendors as you might think, especially for unstructured data. (Yet.) Regardless, the costs of collection processes will depend on how deep into the data you want to go, how much detail you want to pull out, and having the right team to do the job. And no matter where you get it from, there will likely be licensing costs, intellectual property issues, and privacy and security issues to consider. Bottom line: whether you decide to buy access to a database or a simple feed, set up an API, scrape data, or build your own data collection network, you should be prepared to budget for capturing the data you need as a recurring cost.
5. How far back do you want the data to go? How much data do you want or need to keep over time?
Interestingly, this is one that many overlook. It gets to the idea of keeping any, part, or all of data over time and whether that is necessary to your business processes. Given limitations, most technical processes will be designed to minimize the amount of data that is stored. For a lot of machine sensor or Internet of Things (IoT) data where the ability to recall data beyond the last 24 hours may not be important to an overall process, that's fine. But for many others, data provenance, record keeping, and the ability to reference past data points is critically important. For example, think of how police departments around the world might need to store evidence from video data from police cars or vests, or audio and video data captured as part of different Smart City initiatives. Or how industries involved in legal proceedings, forensic or historical analysis, or in medical research and longitudinal studies might need to keep looking at different sections of data over time. Or even with Twitter or news feeds used as part of analytical research. Because if it’s important to know what happened over significant chunks of time, you’re going to have to create a process to capture and store that data and make it accessible to those users who need it.
6. How much effort is it going to take to process the data?
Say you have collected a really cool data set, one that has tons of content that you think will give you great insight. Once you get that raw data into your system, you have to prepare it so it will work with your analytic tools. And that could be a lot of work. If you want to include it in your process on an ongoing basis, you will need to consider how well structured the data is and how much effort it will to get the data ready for use. Because even if it’s clear to the human analyst that a data source can provide incredible content, there may not be an easy or efficient way to capture and structure that data to make it usable. The data may be so messy that cleaning it would basically be akin to building the dataset from scratch. Or the data that you need is so nuanced that the techniques to get to the content are still very new, extremely promising, but have only been tested in laboratory or academic settings. All to say, some datasets are going to be more expensive to get ready for use, and sometimes they will be the ones that give you the deepest insight.
7. How frequently will you need to process the data?
It used to be that we could handle most of our data as data at rest, or stuff we stored in databases or spreadsheets. Now data is coming at us faster than ever before and more frequently as part of a stream. And the more frequently we see the value of tapping into data in motion, the more we should prepare for it to tax our resources. When you’re weighing what data to make usable, stream temporarily, or store, you’re going to have to decide on which data sets at rest need updating more or less frequently, and what in the stream of data in motion is important to capture or make accessible, and when. (I mean, do you REALLY need to keep every tweet, every weather sensor reading, or every shot from that camera in the alley for the next three years?) Because capturing every piece of data in the stream just doesn’t make sense.
8. How is the quantity and magnitude of the data going to affect your bandwidth, and flow through your pipelines?
Big data is big and it takes up a lot of space, including that space your system is going to need to process data and keep it accessible to users. It’s probably not going to be an issue if you’ve got one or two data sources (assuming they’re not massive on their own of course). But if you’re really looking at some seriously big data (most of what’s out there) and multiple sources, you’re going to need to think about how bringing that data in is going to affect the ebbs and flows of your current data ingest pipelines and loads. In other words, you’re going to need to make sure you that you have the capacity to manage the data you bring in without slowing down every other process you need to keep running on a regular basis. And if your plan is to bring in new and bigger data on a rolling basis, better to plan in advance how you will scale your capacity as you go.
9. How frequently will your users want to access the data – how far back will they want to go when accessing? Or are your users really going to want this data, or the insight you assume you can get from it?
It really helps to know what data your users will want to see and interact with before you spend the money to transform it. But with most data driven approaches, deciding on which data is really going to provide insight is a bit of a trial and error process. Part of that process can be mitigated by involving users early on, whether by asking for data recommendations, beta testing samples of potential outputs, or collecting user interaction data on the back end to determine which data sets are being used over time and why. Ultimately though, for most processes, it is the user who will determine whether the data you’re choosing gives them the insight they need. And knowing what data matters is a powerful tool that can help you determine what data to focus your resources on.
So there you have it. Key things that can rapidly escalate costs in any data driven approach. Hopefully knowing this in advance will put you in a better position to figure out whether any new data you’re considering is worth the overall investment in your data driven approach, hidden costs and all.
Because avoiding costly mistakes is always a good thing.