Subscribe to DSC Newsletter

Summary:  If you’re still writing code to clean and prep your data you're missing big opportunities for efficiency and consistency with modern data prep platforms.

 

Two things are true. 

  • Data prep still occupies about 80% of our model building time – the least enjoyable part.
  • If you’re still writing code to do this from the ground up you’re overlooking big chances for efficiency and consistency.

Aside from de novo code, there are at least four categories of data prep platforms.

Fully Integrated:  Many of the advanced analytic platforms have sophisticated data prep built in.  Alteryx for example started off as a data prep platform and gradually added analytic tools to become a full service offering.

Stand Alone:  There are still a wide variety of standalone data prep platforms.  This is a marketing choice that seems to be based on a desire to service classic data analysts and citizen data scientists as well as the data science user with an emphasis on self-service.  It also reaches users who have several analytic platforms but want to standardize data prep and can be an IT-driven choice to offload some ETL work to self-service.  Trifacta and Datameer are examples of standalones.

Building Blocks:  Some of the larger and more mature analytic providers like SAS and Oracle provide a variety of modules that can be used independently but are more frequently linked together into a fully integrated ensemble.

Automated Machine Learning (AML):  By recent count there are now more than 20 providers claiming fully automated data-to-model machine learning.  Tazi.ai is one example of a fully featured AML that automates even feature engineering and feature selection.  Industries like insurance that produce and manage hundreds or even thousands of models have begun to embrace AML.

 

Data Prep Platforms Important Beyond Advanced Analytics

While data prep is central to the modeling activities of data scientists, many of these data prep tools seek to reach a broader self-service user. 

Although our focus here is on advanced predictive analytics there is still a lot of value and activity in the broader analyst job.  So whether the purpose is to build a sophisticated predictive model or simply to feed static historical data into data viz platforms for the purpose of eyeballing trends or any other data-driven initiative, all these uses critically value fast and flexible access to current data.  Less and less that means going to IT for an ETL extract.  Increasingly it means using a self-service tool like these.

 

How Do They Rank

I was motivated to write about data prep platforms now because of a particular opportunity.  It happens that the reviews from three major research organizations are available at the same time.  These are the Ovum Research 2018 Self Service Data Prep report, The Forrester Wave Q1 2017 Data Prep Tools report, and the Gartner Peer Insights Review for Data Prep Tools.  All three can be accessed from the Trifacta home page.

Although there are some differences in who was included and the methodologies used for ranking, here are the three main charts.

 

 

 A note on the Gartner chart.  Gartner peer rankings are based on surveys sent to users, not on independent Gartner evaluations.  I’ve arbitrarily cut the chart off for any platforms that didn’t get at least 10 reviews.  As all you data-smart readers will recognize, comparing numerical rankings of platforms with 300 rankings versus those with 11 or 12 is problematic at best.

 

What’s Included – What’s Different

What you would expect to find as core capabilities turns out to be pretty equal among the alternatives.  That includes the ability to blend data sources, clean for missing or miscoded data, and do basic transforms.

It also includes the code-free ability to handle structured, semi-structured, and unstructured data.

Where this gets a little gray is in the border between data prep and modeling.  The ability to perform transforms (e.g. normalize badly skewed distributions) or to create new features (e.g. the difference between dates or ratios between features) is sometimes excluded or manual, and sometimes augmented by ML suggestions to the user about the next steps they might take.

The real differentiators are a little more subtle.

 

Perils of Self Service and Other Differentiators

Without the single source of truth historically provided by EDWs in BI, the user is left a little to their own devices and can easily go astray.  One differentiator is the inclusion of best in class data catalogues and data dictionaries.

A second issue is governance and permissions.  How do you control access to sensitive information, whether PID or sensitive internal company information.  An additional differentiator is the robustness of this governance, administration, and control feature.

Ovum identifies three ‘battleground’ characteristics among competitors.  The first, above, is data governance.

The second is the ability of users to collaborate with one another in building the database.  This varies fairly widely.

The third and perhaps most interesting is the manner in which the platform uses ML to suggest actions to the user.  These might be sources for enrichment or more granular guidance about missing values, transforms, or other data features that the ML has determined would likely enhance the data for analysis.

From a purely data science perspective though, data prep platforms offer your data scientists working together a variety of benefits, primarily in speed to insight (as compared to code), standardization and repeatability (everyone hitting all the obvious steps for cleaning and transform), and finally to support multiple analytic techniques and platforms.  As we all know some will still want to write their own code for the models themselves and some will have different preferences for platforms depending on the use case.  These platforms, especially those that can be used standalone allow for complete flexibility for modeling once the less pleasant tasks of data prep are complete.

 

 

Other articles by Bill Vorhies.

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected] or [email protected]

 

Views: 1764

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by John L. Ries on October 1, 2018 at 10:23am

If you don't like data prep, you do what my then-boss did many years ago and train a programmer (me in that case) to do it for you.  Personally, I think data prep is fun and the process ended up teaching me a great deal about the modeling process you probably enjoy.

Comment by William Vorhies on September 26, 2018 at 3:10pm

Paul:

Thanks for your insightful comments.  I completely agree with you about platforms, especially those with advanced drag-and-drop.  There is a perception that you lose some control with these but in my experience that's not true and the benefits of efficiency and consistency compared to code are substantial.

When I started in data science in about 2001 SAS and SPSS were the dominant players and were already moving away from their proprietary code toward drag-and-drop.  The transition in academia 7 or 8 years later to teaching in R seems to have been driven financially by the fact that although SAS and SPSS gave essentially free access to students, they still charged instructors, albeit at a large academic discount.  R however was free.

In my mind this was always an unnecessary digression back to the bad old days of coding-only which tended to take the new practitioners eye off the ball of the fundamentals and make it look like just another programming language to master.  I'm glad to see the platforms reemerging that allow us to practice mostly without code and think that a good decade of new data scientists have been misled in their education that coding is superior.

Comment by Paul Bremner on September 25, 2018 at 10:48am

Bill, thanks for posting this.  Great comments and I hadn't realized there were some interesting reports at Trifacta.  I'll have to download them and take a look.

I have to say that this issue of being able to use "platforms" as opposed to programming, for tasks is one of the most misunderstood and mischaracterized things I see in reading various posts in the Data Science Community (primarily by people in the open-source world.).  I love programming (I'm concentrating right now on SAS) but much of what you need to get done can be accomplished with a good drag and drop interface.  In fact, from what I can see, SAS has essentially eliminated the distinction between programming and drag-and-drop.  You're typically looking at a split screen where you can use either.  The quickest way to "write code" is often to use the point-and-click interface to create the program and then you can choose to modify what you've got.  That's typically not necessary because so much effort has been put into creating the drag-and-drop interface in the first place (and, as you note, the companies are doing this both to enhance the productivity of Data Scientists and also to try to enable Citizen Data Scientists to put their toes in the water.)

I continue to see polls out there talking about whether people favor SAS or R/Python, or something else.  Of course, it's interesting to see the results but I always wonder whether the open source community actually knows/appreciates what's going on with the various applications, whether that's SAS, Tableau, or any of the other multitude of apps that are largely relegating programming to a situation where it's necessary in only a small number of instances.  Of course, as someone who likes programming and continues to invest time learning more of it, I think the best scenario is to be able to use a drag-ang-drop interface and also have the capability to use programming on the occasions that might be necessary.

One thing you might want to look at again is the Gartner reports.  It's strange that the one you reference doesn't mention SAS.  The July 2018 Gartner report titled "Magic Quadrant for Data Integration Tools" (on the SAS site below) shows SAS in one of top positions.  Although perhaps the definition they're using doesn't match up somehow with what you're describing here. 

Magic Quadrant for Data Integration Tools (Gartner)

https://www.sas.com/en_us/news/analyst-viewpoints/gartner-magic-qua...

 

Videos

  • Add Videos
  • View All

Follow Us

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service