Summary: We are entering a new phase in the practice of data science, the ‘Code-Free’ era. Like all major changes this one has not sprung fully grown but the movement is now large enough that its momentum is clear. Here’s what you need to know.
We are entering a new phase in the practice of data science, the ‘Code-Free’ era. Like all major changes this one has not sprung fully grown but the movement is now large enough that its momentum is clear.
Barely a week goes by that we don’t learn about some new automated / no-code capability being introduced. Sometimes these are new startups with integrated offerings. More frequently they’re features or modules being added by existing analytic platform vendors.
I’ve been following these automated machine learning (AML) platforms since they emerged. I wrote first about them in the spring of 2016 under the somewhat scary title “Data Scientists Automated and Unemployed by 2025!”.
Of course this was never my prediction, but in the last 2 ½ years the spread of automated features in our profession has been striking.
No Code Data Science
No-Code data science, or automated machine learning, or as Gartner has tried to brand this, ‘augmented’ data science offers a continuum of ease-of-use. These range from:
This list also pretty well describes the developmental timeline. Guided Platforms are now old hat. AML platforms are becoming numerous and mature. Conversational analytics is just beginning.
Not Just for Advanced Analytics
This smart augmentation of our tools extends beyond predictive / prescriptive modeling into the realm of data blending and prep, and even into data viz. What this means is that code-free smart features are being made available to classical BI business analysts, and of course to power user LOB managers (aka Citizen Data Scientists).
The market drivers for this evolution are well known. In advanced analytics and AI it’s about the shortage, cost, and acquisition of sufficient skilled data scientists. In this realm it’s about time to insight, efficiency, and consistency. Essentially doing more with less and faster.
However in the data prep, blending, feature identification world which is also important to data scientists, the real draw is the much larger data analyst / BI practitioner world. In this world the ETL of classic static data is still a huge burden and time delay that is moving rapidly from an IT specialist function to self-service.
Everything Old is New Again
When I started in data science in about 2001 SAS and SPSS were the dominant players and were already moving away from their proprietary code toward drag-and-drop, the earliest form of this automation.
The transition in academia 7 or 8 years later to teaching in R seems to have been driven financially by the fact that although SAS and SPSS gave essentially free access to students, they still charged instructors, albeit at a large academic discount. R however was free.
We then regressed back to an age, continuing till today when to be a data scientist means working in code. That’s the way this current generation of data scientists has been taught, and expectedly, that’s how they practice.
There has also been an incorrect bias that working in a drag-and-drop system did not allow the fine grain hyperparameter tuning that code allows. If you’ve ever worked in SAS Enterprise Miner or its competitors you know this is incorrect, and in fact that fine tuning is made all the easier.
In my mind this was always an unnecessary digression back to the bad old days of coding-only which tended to take the new practitioner’s eye off the ball of the fundamentals and make it look like just another programming language to master. So I for one both welcome and expected this return to procedures that are both speedy and consistent among practitioners.
What About Model Quality
We tend to think of a ‘win’ in advanced analytics as improving the accuracy of a model. There’s a perception that relying on automated No-Code solutions gives up some of this accuracy. This isn’t true.
The AutoML platforms like DataRobot, Tazi.ai, and OneClick.ai (among many others) not only run hundreds of model types in parallel including variations on hyperparameters, but they also perform transforms, feature selection, and even some feature engineering. It’s unlikely that you’re going to beat one of these platforms on pure accuracy.
A caveat here is that domain expertise applied to feature engineering is still a human advantage.
Perhaps more importantly, when we’re talking about variations in accuracy at the second or third data point, is the many weeks you spent on development a good cost tradeoff compared to the few days or even hours these AutoML platforms offer?
The Broader Impact of No Code
It seems to me that the biggest beneficiaries of no-code are actually classic data analysts and LOB managers who continue to be most focused on BI static data. The standalone data blending and prep platforms are a huge benefit to this group (and to IT whose workload is significantly lightened).
These no-code data prep platforms like ClearStory Data, Paxata, and Trifacta are moving rapidly to incorporate ML features into their processes that help users select which data sources are appropriate to blend, what the data items actually mean (using more ad hoc sources in the absence of good data dictionaries), and even extending into feature engineering and feature selection.
Modern data prep platforms are using embedded ML for example for smart automated cleaning or treatment of outliers.
Others like Octopai, just reviewed by Gartner as one of “5 Cool Companies” focus on enabling users to quickly find trusted data through automation by using machine learning and pattern analysis to determine the relationships among different data elements, the context in which the data was created, and the data’s prior uses and transformations.
These platforms also enable secure self-service by enforcing permissions and protecting PID and other similarly sensitive data.
Even data viz leader Tableau is rolling out conversational analytic features using NLP and other ML tools to allow users to pose queries in plain English and return optimum visualizations.
What Does This Actually Mean for Data Scientists
Gartner believes that within two years, by 2020, citizen data scientists will surpass data scientists in the quantity and value of the advanced analytics they produce. They propose that data scientists will instead focus on specialized problems and embedding enterprise-grade models into applications.
I disagree. This would seem to relegate data scientists to the role of QA and implementation. That’s not what we signed on for.
My take is that this will rapidly expand the use of advanced analytics deeper and deeper into organizations thanks to smaller groups of data scientists being able to handle more and more projects.
We’ve already emerged by only a year or two from where the data scientist’s most important skills included blending and cleaning the data, and selecting the right predictive algorithms for the task. These are specifically the areas that augmented/automatic no-code tools are taking over.
Companies that must create, monitor, and manage hundreds or thousands of models have been the earliest adopters, specifically insurance and financial services.
What’s that leave? It leaves the senior role of Analytic Translator. That’s the role McKinsey recently identified as the most important in any data science initiative. In short, the job of Analytics Translator is to:
In other words, translate business problems into data science projects and lead in quantifying the various types of risk and rewards that allow these projects to be prioritized.
What About AI?
Yes even our most recent advancements into image, text, and speech with CNNs and RNNs are rapidly being rolled out as automated no-code solutions. And it couldn’t come fast enough because the shortage of data scientists with deep learning skills is even greater than with our more general practitioners.
Both Microsoft and Google rolled out automated deep learning platforms within the last year. These started with transfer learning but are headed toward full AutoDL. See Microsoft Custom Vision Services (https://www.customvision.ai/) and Google’s similar entry Cloud AutoML.
There are also a number of startup integrated AutoDL platforms. We reviewed OneClick.AI earlier this year. They include both a full AutoML and AutoDL platform. Gartner recently nominated DimensionalMechanics as one of its “5 Cool Companies” with an AutoDL platform.
For a while I tried to personally keep up with the list of vendors of both No-Code AutoML and AutoDL and offer updates on their capabilities. This rapidly became too much.
I was hoping Gartner or some other worthy group would step up with a comprehensive review and in 2017 Gartner did a fairly lengthy report “Augmented Analytics In the Future of Data and Analytics”. The report was a good broad brush but failed to capture many of the vendors I was personally aware of.
To the best of my knowledge there’s still no comprehensive listing of all the platforms that offer either complete automation or significantly automated features. They do however run from IBM and SAS all the way down to small startups, all worthy of your consideration.
Many of these are mentioned or reviewed in the articles linked below. If you’re using advanced analytics in any form, or simply want to make your traditional business analysis function better, look at the solutions mentioned in these.
Additional articles on Automated Machine Learning, Automated Deep Learning, and Other No-Code Solutions
What’s New in Data Prep (September 2018)
Democratizing Deep Learning – The Stanford Dawn Project (September 2018)
Transfer Learning –Deep Learning for Everyone (April 2018)
Automated Deep Learning – So Simple Anyone Can Do It (April 2018)
Next Generation Automated Machine Learning (AML) (April 2018)
More on Fully Automated Machine Learning (August 2017)
Automated Machine Learning for Professionals (July 2017)
Data Scientists Automated and Unemployed by 2025! (April 2016)
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001. He can be reached at: