In Big Data, Preparing the Data is Most of the Work

A common misconception about Big Data is that it is a black box: you load data and magically gain insight. This is not the case. As this New York Times article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” describes, loading a big data platform with quality data with enough structure to deliver value is a lot of work. Data scientist spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the Times article estimates 50%-80% of a data scientists' time is spent on data preparation. We agree.

Data Selection

Before you start your project, define what data you need. This seems obvious, but in the world of big data, we hear a lot of people say, “just throw it all in”. If you ingest low quality data that is not salient to your business objectives, it will add noise to your results.

The more noisy the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.

Define the Relationships

In most corporate big data projects, the business challenges demand a data store comprised of a combination of structured, semi-structured and unstructured data. You often need to organize a set of unstructured/semi-structured documents from SharePoint or a shared drive against master data contained in a set of structured systems. When importing structured data from multiple systems, the data relationships must be defined. Your big data platform will not magically know that “customer no” in one set of data is the same as “cust_id” in another. You must define the relationships between the data sources.

This is a common challenge that many organizations are facing. As such, there are some interesting products coming to market to assist data scientists in identifying possible common data elements in large data sets as described here.

Extract and Organize

This is where you will spend the most time. This is where we have spent the most time working for our clients. Acquiring the data can be a major challenge. If it is public data, is there an API or do we have to scrape it from the web? If corporate data, who can provide extracts and documentation on the data structure? What are the security considerations? The organization of the data includes many steps: translating system specific codes into meaningful/usable data, mapping common fields consistently to be able to relate them, handling incomplete or erroneous data, replicating application logic to make the data self-describing. The list seems endless. You have to spend a lot of time inspecting the data, querying the data, and processing it.

A further difficulty we have experienced during this very long phase is that you have nothing to show your stakeholders. They expect slick demos with glossy visualizations, and they expect them quickly. And you are stuck in the data.

Load the Data

You’ve done it. Finally, the data is ready to load to your big data platform, and the exciting work of analytics and visualization can begin. With clean, organized, structured data, the analytics and visualization phase will progress quickly and will deliver real value.


Preparing data for ingest to a big data platform is a lot of work. There are no shortcuts. If you want to achieve valuable insights via analytics and visualizations, you've got to invest the time to build a high quality data store. Set expectations carefully with you stakeholders to prepare them for the investment in data preparation. You will be glad you did. 

Views: 10271

Tags: Big, Data, Preparation, Wrangling


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sione Palu on January 14, 2015 at 12:59pm

I agree that preparing data is a lot of work, but once a framework of data preparation process is in place, then it almost stays the same for many data sources current & future, so it means that its just re-use the same process in future projects (with not much modifications).

IMO (at least for my team's), the running of the model is the longest process. A small dataset can even take a whole week to run in Matlab using the parallel toolbox deployed on Amazon webservices. A small dataset like 1 MB of say 100 time-series for a duration of 1 year (4 inputs & 1 output - MISO).  Running linear ARMAX model on this small dataset could take up to 7 days at most (with Matlab parallel toolbox), because it does optimal parameter searches first, followed by prediction/validation.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service