Note: This article was originally posted on OpenDataScience.com.
Witnessing the data science field’s meteoric rise in demand across pretty much all industries and areas of scientific research, it’s easy to anticipate efforts to create shortcuts to satisfy the need for more data science practitioners. The current trend of automated machine learning is a great case in point. This article will touch on a number of efforts to circumvent the need for data scientists to select and train machine learning models and determine metrics for measuring their performance.
The search for automated approaches in computer science is not new. I can remember as far back as the 1980s when the birth of the Personal Computer triggered a steep advance in the demand for programmers to develop software for the small machines. There were many attempts at “automated programming” and “code generators” designed to advance the idea of point-and-click software development. It never really took off because the goals weren’t realistic, replacing human coders. Something similar is happening today with “automated machine learning”
Should data scientists be concerned? A recent Pew Research Center study provided the percentage of U.S. adults who think certain professions will be replaced by robots or computers in their lifetimes showed that 53% of software developers believed that their jobs would be replaced “somewhat or very likely.”  So this means a majority of software developers believe that automated means will take their profession away. This is pretty amazing when you think about it. I don’t believe there is nearly enough evidence to back up that viewpoint. I’m hoping to see another study just for data scientists. Nevertheless, let’s survey the landscape of this compelling segment of technology.
Automated Machine Learning Platforms
There are a number of new approaches for automated machine learning platforms, and we see these prominently mentioned in the industry’s news cycle. For example, Google’s Cloud AutoML is a suite of machine learning products that enables developers with limited machine learning expertise to train models specific to their business needs, by leveraging Google’s new technologies: AutoML Vision, AutoML Natural Language, and AutoML Translation. AutoML is the result of 10 years of Google Research efforts and provides a simple graphical user interface (GUI) for users to train, evaluate, improve, and deploy models based on their own data.
There is also Auto-Keras which is described in the paper: “Efficient Neural Architecture Search with Network Morphism,” by Haifeng Jin, Qingquan Song, and Xia Hu. Auto-Keras is an open source software library for automated machine learning. It is developed by DATA Lab at Texas A&M University in addition to community contributors. The ultimate goal of automated machine learning is to provide easily accessible deep learning tools to domain experts with limited data science or machine learning background. A key element of Auto-Keras is that it provides functions to automatically search for architecture and hyperparameters of deep learning models.
H2O.ai also has their own AutoML platform. H2O has made it easy for non-experts to experiment with machine learning. Although these tools have made it easy to train and evaluate machine learning models, there is still a fair amount of knowledge and background in data science required to produce high-performing machine learning models.
Democratizing Data Science
A team of MIT researchers is working to advance what they call “the democratization of data science” with a new tool for non-statisticians that automatically generates models for analyzing raw data. Touted as a tool requiring users to only write a few lines of code to uncover insights into various problem domains such as financial trends, air travel, voting patterns, the spread of disease, etc.
One the researchers, Feras Saad, a Ph.D. student in the Department of Electrical Engineering and Computer Science (EECS), gave a talk “Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling” where he presented new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. Saad recognized that people have a lot of data sets in various data silos, and his goal is to build systems that let people automatically get models they can use to ask questions about that data. The associated paper, “Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling,” presented at the ACM SIGPLAN Symposium on Principles of Programming Languages, shows how the tool can accurately extract patterns and make predictions from real-world data sets, and even outperform manually constructed models in certain data-analytics tasks.
There are also a number of startup companies working to push the envelope with automated machine learning from a number of different perspectives:
Binah’s Automatic Data Science Engine solution includes signal processing, what they call their “secret sauce,” which the company sees as a critical component for pre-processing the data, and speeding mathematical modeling. The data is then combined with proprietary algorithms, machine and deep learning, and artificial intelligence. The augmented data analytics solution comprises complex mathematical algorithms for data processing, modeling, training, and testing.
DataRobot captures the knowledge, experience, and best practices of the top data scientists, and delivers a high level of automation and ease-of-use for machine learning initiatives. DataRobot enables users to build and deploy highly accurate machine learning models in a fraction of the time it takes using traditional data science methods.
The ability to truly democratize the process is perhaps the most important element of any enterprise machine learning platform. DataRobot automates the entire modeling lifecycle, enabling users to quickly and easily build predictive models. The company advises that coding and machine learning skills are completely optional.
BigML promotes the notion of “machine learning for everyone” by providing a cloud-based Machine Learning as a Service (MLaaS) solution. BigML is a machine learning platform that lets developers create enterprise-level predictive applications. In a sense, it’s like Tableau for machine learning in that it’s easy-to-use, visually appealing, and comprehensible so that even those without in-depth data science knowledge can create and deploy models.
MissingLink offers a different approach for automated machine learning, by helping data scientists streamline and automate the entire deep learning lifecycle. The platform lets you train your model more frequently, at lower cost and with greater confidence. Compatible with the popular frameworks: Tensorfow, Keras, PyTorch, MissingLink lets you easily manage the experiment, data, and resources all in one place. The dashboard lets you see the experiment’s hyperparameters, code, data, logs, artifacts, and more. It’s easy to compare or reproduce an experiment. The platform can also automatically determine the optimal hyperparameters. Scale as you grow and optimize your resources for the best possible outcome and return on investment.
In the last several years, the demand for data scientists with machine learning expertise has outpaced the supply of this skillset despite the surge of people entering the field. To address this gap, there have been significant strides in the development of user-friendly machine learning software that can be used by non-experts. Only time will tell how successful these efforts will be. As a data scientist myself, I hope they’re not “too successful” since I really love my job!
 LA Times, Sunday, October 14, 2018 “The rise of the machines: Robots reshape job market”