A common misconception about Big Data is that it is a black box: you load data and magically gain insight. This is not the case. As this New York Times article “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights” describes, loading a big data platform with quality data with enough structure to deliver value is a lot of work. Data scientist spend a comparatively large amount of time in the data preparation phase of a project. Whether you call it data wrangling, data munging, or data janitor work, the Times article estimates 50%-80% of a data scientists' time is spent on data preparation. We agree.
Data Selection
The more noisy the data, the more difficult it will be to see the important trends. You must have a defined strategy for the data sources you need and the particular subset of that data, which is relevant for the questions you want to ask.
Define the Relationships
This is a common challenge that many organizations are facing. As such, there are some interesting products coming to market to assist data scientists in identifying possible common data elements in large data sets as described here.
Extract and Organize
This is where you will spend the most time. This is where we have spent the most time working for our clients. Acquiring the data can be a major challenge. If it is public data, is there an API or do we have to scrape it from the web? If corporate data, who can provide extracts and documentation on the data structure? What are the security considerations? The organization of the data includes many steps: translating system specific codes into meaningful/usable data, mapping common fields consistently to be able to relate them, handling incomplete or erroneous data, replicating application logic to make the data self-describing. The list seems endless. You have to spend a lot of time inspecting the data, querying the data, and processing it.
A further difficulty we have experienced during this very long phase is that you have nothing to show your stakeholders. They expect slick demos with glossy visualizations, and they expect them quickly. And you are stuck in the data.
Load the Data
You’ve done it. Finally, the data is ready to load to your big data platform, and the exciting work of analytics and visualization can begin. With clean, organized, structured data, the analytics and visualization phase will progress quickly and will deliver real value.
Conclusion
Preparing data for ingest to a big data platform is a lot of work. There are no shortcuts. If you want to achieve valuable insights via analytics and visualizations, you've got to invest the time to build a high quality data store. Set expectations carefully with you stakeholders to prepare them for the investment in data preparation. You will be glad you did.
Comment
I agree that preparing data is a lot of work, but once a framework of data preparation process is in place, then it almost stays the same for many data sources current & future, so it means that its just re-use the same process in future projects (with not much modifications).
IMO (at least for my team's), the running of the model is the longest process. A small dataset can even take a whole week to run in Matlab using the parallel toolbox deployed on Amazon webservices. A small dataset like 1 MB of say 100 time-series for a duration of 1 year (4 inputs & 1 output - MISO). Running linear ARMAX model on this small dataset could take up to 7 days at most (with Matlab parallel toolbox), because it does optimal parameter searches first, followed by prediction/validation.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central