Subscribe to DSC Newsletter

Summary:  We are entering a new phase in the practice of data science, the ‘Code-Free’ era.  Like all major changes this one has not sprung fully grown but the movement is now large enough that its momentum is clear.  Here’s what you need to know.

 

We are entering a new phase in the practice of data science, the ‘Code-Free’ era.  Like all major changes this one has not sprung fully grown but the movement is now large enough that its momentum is clear.

Barely a week goes by that we don’t learn about some new automated / no-code capability being introduced.  Sometimes these are new startups with integrated offerings.  More frequently they’re features or modules being added by existing analytic platform vendors.

I’ve been following these automated machine learning (AML) platforms since they emerged.  I wrote first about them in the spring of 2016 under the somewhat scary title “Data Scientists Automated and Unemployed by 2025!”.

Of course this was never my prediction, but in the last 2 ½ years the spread of automated features in our profession has been striking.

 

No Code Data Science

No-Code data science, or automated machine learning, or as Gartner has tried to brand this, ‘augmented’ data science offers a continuum of ease-of-use.  These range from:

  • Guided Platforms: Platforms with highly guided modeling procedures (but still requiring the user to move through the steps, (e.g. BigML, SAS, Alteryx). Classic drag-and-drop platforms are the basis for this generation.
  • Automated Machine Learning (AML): Fully automated machine learning platforms (e.g. DataRobot).
  • Conversational Analytics: In this last version, the user merely poses the question to be solved in common English and the platform presents the best answer, selecting data, features, modeling technique, and presumably even best data visualization.

This list also pretty well describes the developmental timeline.  Guided Platforms are now old hat.  AML platforms are becoming numerous and mature.  Conversational analytics is just beginning.

 

Not Just for Advanced Analytics

This smart augmentation of our tools extends beyond predictive / prescriptive modeling into the realm of data blending and prep, and even into data viz.  What this means is that code-free smart features are being made available to classical BI business analysts, and of course to power user LOB managers (aka Citizen Data Scientists).

The market drivers for this evolution are well known.  In advanced analytics and AI it’s about the shortage, cost, and acquisition of sufficient skilled data scientists.  In this realm it’s about time to insight, efficiency, and consistency.  Essentially doing more with less and faster.

However in the data prep, blending, feature identification world which is also important to data scientists, the real draw is the much larger data analyst / BI practitioner world.  In this world the ETL of classic static data is still a huge burden and time delay that is moving rapidly from an IT specialist function to self-service.

 

Everything Old is New Again

When I started in data science in about 2001 SAS and SPSS were the dominant players and were already moving away from their proprietary code toward drag-and-drop, the earliest form of this automation.  

The transition in academia 7 or 8 years later to teaching in R seems to have been driven financially by the fact that although SAS and SPSS gave essentially free access to students, they still charged instructors, albeit at a large academic discount.  R however was free.

We then regressed back to an age, continuing till today when to be a data scientist means working in code.  That’s the way this current generation of data scientists has been taught, and expectedly, that’s how they practice. 

There has also been an incorrect bias that working in a drag-and-drop system did not allow the fine grain hyperparameter tuning that code allows.  If you’ve ever worked in SAS Enterprise Miner or its competitors you know this is incorrect, and in fact that fine tuning is made all the easier.

In my mind this was always an unnecessary digression back to the bad old days of coding-only which tended to take the new practitioner’s eye off the ball of the fundamentals and make it look like just another programming language to master.  So I for one both welcome and expected this return to procedures that are both speedy and consistent among practitioners.

 

What About Model Quality

We tend to think of a ‘win’ in advanced analytics as improving the accuracy of a model.  There’s a perception that relying on automated No-Code solutions gives up some of this accuracy.  This isn’t true.

The AutoML platforms like DataRobot, Tazi.ai, and OneClick.ai (among many others) not only run hundreds of model types in parallel including variations on hyperparameters, but they also perform transforms, feature selection, and even some feature engineering.  It’s unlikely that you’re going to beat one of these platforms on pure accuracy. 

A caveat here is that domain expertise applied to feature engineering is still a human advantage.

Perhaps more importantly, when we’re talking about variations in accuracy at the second or third data point, is the many weeks you spent on development a good cost tradeoff compared to the few days or even hours these AutoML platforms offer?

 

The Broader Impact of No Code

It seems to me that the biggest beneficiaries of no-code are actually classic data analysts and LOB managers who continue to be most focused on BI static data.  The standalone data blending and prep platforms are a huge benefit to this group (and to IT whose workload is significantly lightened).

These no-code data prep platforms like ClearStory Data, Paxata, and Trifacta are moving rapidly to incorporate ML features into their processes that help users select which data sources are appropriate to blend, what the data items actually mean (using more ad hoc sources in the absence of good data dictionaries), and even extending into feature engineering and feature selection. 

Modern data prep platforms are using embedded ML for example for smart automated cleaning or treatment of outliers.

Others like Octopai, just reviewed by Gartner as one of “5 Cool Companies” focus on enabling users to quickly find trusted data through automation by using machine learning and pattern analysis to determine the relationships among different data elements, the context in which the data was created, and the data’s prior uses and transformations.

These platforms also enable secure self-service by enforcing permissions and protecting PID and other similarly sensitive data.

Even data viz leader Tableau is rolling out conversational analytic features using NLP and other ML tools to allow users to pose queries in plain English and return optimum visualizations.

 

What Does This Actually Mean for Data Scientists

Gartner believes that within two years, by 2020, citizen data scientists will surpass data scientists in the quantity and value of the advanced analytics they produce.  They propose that data scientists will instead focus on specialized problems and embedding enterprise-grade models into applications.

I disagree.  This would seem to relegate data scientists to the role of QA and implementation.  That’s not what we signed on for.

My take is that this will rapidly expand the use of advanced analytics deeper and deeper into organizations thanks to smaller groups of data scientists being able to handle more and more projects.

We’ve already emerged by only a year or two from where the data scientist’s most important skills included blending and cleaning the data, and selecting the right predictive algorithms for the task.  These are specifically the areas that augmented/automatic no-code tools are taking over.

Companies that must create, monitor, and manage hundreds or thousands of models have been the earliest adopters, specifically insurance and financial services.

What’s that leave?  It leaves the senior role of Analytic Translator.  That’s the role McKinsey recently identified as the most important in any data science initiative.  In short, the job of Analytics Translator is to:

  1. Lead the identification of opportunities where advanced analytics can make a difference.
  2. Facilitate the process of prioritizing these opportunities.
  3. Frequently serve as project manager on the projects.
  4. Actively champion adoption of the solutions across the business and promote cost effective scaling.

In other words, translate business problems into data science projects and lead in quantifying the various types of risk and rewards that allow these projects to be prioritized.

 

What About AI?

Yes even our most recent advancements into image, text, and speech with CNNs and RNNs are rapidly being rolled out as automated no-code solutions.  And it couldn’t come fast enough because the shortage of data scientists with deep learning skills is even greater than with our more general practitioners.

Both Microsoft and Google rolled out automated deep learning platforms within the last year.  These started with transfer learning but are headed toward full AutoDL.  See Microsoft Custom Vision Services (https://www.customvision.ai/) and Google’s similar entry Cloud AutoML.

There are also a number of startup integrated AutoDL platforms.  We reviewed OneClick.AI earlier this year.  They include both a full AutoML and AutoDL platform.  Gartner recently nominated DimensionalMechanics as one of its “5 Cool Companies” with an AutoDL platform.

For a while I tried to personally keep up with the list of vendors of both No-Code AutoML and AutoDL and offer updates on their capabilities.  This rapidly became too much. 

I was hoping Gartner or some other worthy group would step up with a comprehensive review and in 2017 Gartner did a fairly lengthy report “Augmented Analytics In the Future of Data and Analytics”.  The report was a good broad brush but failed to capture many of the vendors I was personally aware of.

To the best of my knowledge there’s still no comprehensive listing of all the platforms that offer either complete automation or significantly automated features.  They do however run from IBM and SAS all the way down to small startups, all worthy of your consideration.

Many of these are mentioned or reviewed in the articles linked below.  If you’re using advanced analytics in any form, or simply want to make your traditional business analysis function better, look at the solutions mentioned in these.

 

Additional articles on Automated Machine Learning, Automated Deep Learning, and Other No-Code Solutions

What’s New in Data Prep   (September 2018)

Democratizing Deep Learning – The Stanford Dawn Project (September 2018)

Transfer Learning –Deep Learning for Everyone (April 2018)

Automated Deep Learning – So Simple Anyone Can Do It (April 2018)

Next Generation Automated Machine Learning (AML) (April 2018)

More on Fully Automated Machine Learning (August 2017)

Automated Machine Learning for Professionals  (July 2017)

Data Scientists Automated and Unemployed by 2025 - Update!  (July 2017)

Data Scientists Automated and Unemployed by 2025!  (April 2016)

 

Other articles by Bill Vorhies.

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected] or [email protected]

 

Views: 8641

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Paul Bremner on Monday

Bill, thanks for this post.  Great comments, although on the subject of SAS I would add that (as you know), technically speaking, SAS is not a “code-free” data science platform.  It provides a full programming interface so if you want to work in code that’s easily doable in concert with the drag-and-drops.  Not sure how much you’d use coding because the interface is so good.  But I agree that, for all intents and purposes we’re in a “code free” era in data science.

 

I would also agree that Gartner’s characterization of how this affects data scientists is really off the mark.  The “augmented analytics ” piece to which you refer cites a number of examples in marketing campaigns, banks loans, etc., and talks about the many successes of smaller,  new market entrants, with their drag and drop programs (data prep, modeling data, etc.)  Gartner neglects to mention who manages/runs these processes, leaving the reader with the impression that it is “citizen data scientists.”

These sorts of success stories are nothing new, as anyone who has read the Total Cost of Ownership studies on the SAS site can attest.  Typically, you’re taking a fragmented process with lots of people using spreadsheets and/or custom coding (read open source programming) to do everything from scratch, and you’re moving that all into one, coherent process using an automated application.  Automation increases the amount of work that can be performed, reduces the time required and requires fewer people doing the work.  The people doing the work, however, are IT/ programmers/statisticians, etc.  They are basically data scientists, not “citizen data scientists,” (i.e. people with some other primary job in a business unit who are getting their toes wet with data.)  Automation allows you to run more campaigns, models, expand the number of internal customers you serve, and thereby increase your visibility (and exposure) in the organization.  The last thing you‘d do in this situation is turn all this over to a “citizen data scientist.”

 

As you note, Gartner makes the statement that by 2020 the number of citizen data scientists will grow five times faster than the number of expert data scientists” and that “citizen data scientists will surpass data scientists in terms of the amount of advanced analysis they produce and the value derived from it.”  And then they go on to say the following (underlining added for emphasis):

 

“....the emergence of augmented data discovery represents an entirely new level of business user autonomy, which could not only yield sizable returns but, if left unchecked, could also have adverse results.

 

".… Enabling citizen data scientists within an organization to use augmented data preparation, augmented data discovery and augmented data science and machine learning tools will promote widespread use of higher-value analytics within business processes.  However, inputs and outputs need to be validated, which requires collaboration between IT staff, business users and data science teams.”

 

In other words, Gartner has defined a category of people (citizen data scientists) that’s rapidly growing but can function effectively only if the “real/expert data scientists” tell them which inputs and outputs are valid.  Doesn’t sound like much of an advance in data science or analytic capabilities for a company if people need to be told which models are actually valid.  

 

The proposed solution to this “competence deficit” is that business users, and all their managers, should undergo training in data analysis, statistics and interpretation.  As everyone in the data science/analytics world knows, there are tons of courses and training (much of it free) that people use to increase their knowledge of statistics and programming, and people typically list this information in profiles like LinkedIn.   I would submit that if people with no technical background whose job is producing syndicated research on the analytics industry have not availed themselves of such opportunities to learn statistics or Data Science technologies, then it is unlikely that individuals whose jobs focus on things other than analytics (i.e. marketing, finance, sales, ops) will have the time or inclination to engage in such training.

 

It’s also puzzling that Gartner says “expert data scientists” will “focus on specialized problems and on embedding enterprise-grade models into applications.”  I’m not sure whether that means, as you said, that data scientists will essentially be doing just QA/implementation.   I’m wondering if Gartner might also mean that advanced analytics will eventually all be folded into enterprise applications like ERP/SCM/CRM which, broadly speaking, have analytic capabilities but are not really considered to have truly “advanced analytics” capability.  The analysts who authored this study produce the “Business Intelligence” assessments; another group of individuals produce the Data Science and Machine Learning study.  The backgrounds of the people in the first group are general business and their experience with “analytics” is with applications like Business Objects, Cognos and Hyperion so it would be understandable if they thought analytics will ultimately evolve in that direction and into those types of applications.  The folks in the second group have backgrounds/degrees in math, statistics, computer science, etc., and might have a different perspective.

 

At any rate, the concept of augmented or conversational analytics is certainly a new, fascinating development in the data science world, and the next step beyond drag-and-drop.  Your comments about the move to “R” over the last few years representing an “unnecessary digression back to the bad old days of coding-only,” despite the move by SAS and SPSS to drag-and-drop, are interesting.  Never really thought about it that way before.

 

When I got into data analytics several years back (I come from a non-quantitative, MBA, marketing background) I initially chose SAS and SQL programming.  I’ve toyed with the idea of adding open source, but have always resisted Python/R because to me it always felt like that would be “going in the wrong direction.”  Programming is fascinating but I also wanted the capability to use a drag and drop interface;  not just due to the increased speed and productivity in general research, but also because it provides the ability to sit down with stakeholders (marketing, finance, ops, sales, strategy) in a typical one-hour meeting, be able to generate models/analysis on the fly, feed back in real-time directly to the business users, modify assumptions and redo the models/analysis (all of this assumes, of course, that you have a really solid understanding of the statistics being used and what the platform is and isn’t doing.)  I’d never attempt that sort of thing with programming; certainly not Python/R with all their unnecessary coding complexity, and not even SAS or SQL programming, both of which are relatively clean, minimalistic languages.  It would be great if the next step in this process allowed people like me to simply use verbal commands and accomplish the same goals much faster.  Don’t see why that shouldn’t be possible.

 

For some time it’s seemed to me that, despite the Data Science Community’s focus on developing all kinds of new models (do we really need nearly 30 kinds of Neural Networking?), the biggest issue in operationalizing Data Science in organizations, and getting things done, is bridging the gap between Data Science/Data Analytics and business users (your earlier post and discussion about "Analytics Translators" is right on the mark.)  Making DS platforms even easier and faster to use (for people who actually know data science and stats) should help in that regard.

 

Having said that, I agree with Patrick that programming is a valuable part of anyone’s tool set and provides a competitive differentiator, however infrequently it may be needed when using automated ML applications.  In fact, that’s one of the reasons SAS teaches programming as part of its Enterprise Miner courses.  If you look at the initial course for SAS Enterprise Miner (“Applied Analytics for EM”), it’s essentially all drop-downs.  But in the advanced course (“Advanced Analytics Using EM”), you’re using drop-downs for 2/3 of the material and programming statements for maybe 1/3.  As SAS says in its description of the “code node” in Enterprise Miner, using programming allows Enterprise Miner users to utilize all aspects of the entire SAS system and expand use of EM beyond those provided in the drop-downs.

Comment by Patrick Stroh on Friday

My perspective is that autoML may "democratize data science", but that it's ease and impact will diminish as adoption widens.  There are many examples of tools becoming easier to use, etc. The net result is that everyone uses them, and any competitive advantage fades very quickly.  Of course, the autoML will continue to get better (new interfaces, pipelines to execution platforms, etc.).  But there will always be "custom" problems outside those platforms that will require smart, technical talent.  Moral of the story for "today's data scientists": keep learning, find those cracks in the autoML/automation tech stack.  That's where the competitive differentiation will be, and your value will be maintained.

Comment by Vincent Granville on October 10, 2018 at 5:59am

I call it the full stack data scientist (see my article here.) Coding is just one aspect of the data scientist role, but not one that is mandatory to be called data scientist. The less coding (that is, the more automation) the more time data scientists can spend on high level tasks. In my case, I have code that writes code, use API's, use platforms (to solve equations for instance) and usually do little coding except for unusual, ad-hoc problems. Another way to minimize coding is to rely on tools (software) and libraries (Python, R, etc.) SQL coding can be done with visual dashboards, without explicitly entering code. 

Follow Us

Videos

  • Add Videos
  • View All

© 2018   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service