Subscribe to DSC Newsletter

More on Fully Automated Machine Learning

Summary:  Recently we’ve been profiling Automated Machine Learning (AML) platforms, both of the professional variety, and particularly those proprietary one-click-to-model variety that are being pitched to untrained analysts and line-of-business managers.  Since our first article, readers have suggested some additional companies we should look at which are profiled here along with some interesting observations about who is buying and why.

 

 

Recently we’ve written a series of articles on Automated Machine Learning (AML) which are platforms or packages designed to take over the most repetitive elements of preparing predictive models.  Typically these cover cleaning, preprocessing, some feature engineering, feature selection, and then model creation using one or several algorithms including hyperparameter optimization.  Most will then offer code export and an API for scoring.

 

Two Major Schools

These are grouped into two major schools.  Tools for professional data scientists are open source packages in Python or R that are integrated with common libraries like scikit-learn.  Since it requires knowledge of these common data science scripting languages it’s unlikely that non-data scientists will try to use them, hence our “professional” title.

The second school however should capture our attention for different reasons.  These are commercial platforms with well-designed UIs requiring no code, but offering the same or greater levels of automation.  We call these One-Click Data-In Model-Out and while some are designed to appeal to professional users, others are clearly targeting citizen data scientists and analysts with little or no formal data science training.

Particularly these One-Clicks offer some professional advantages.

  • For shops producing a high volume of customer behavior models or regression value forecasts these can be real time savers.  The best of this bunch run all the modern algorithms and even create ensembles.  The one’s I’ve tested produced very good results and you can’t beat them for speed.  Even SAS and SPSS have incorporated some of these features into their various products.  So whether you’re producing a lot of models or whether you just don’t have enough trained data scientists to keep up these are worth a look.
  • Their second marketing target however is a bit more controversial.  These platforms are also (and sometimes exclusively) pitching to the untrained or at least lesser trained analyst and line-of-business user to bypass the use of data scientists.  Yes the shortage of data scientists may create bottlenecks.  Compounded by the fact that Gartner says this citizen data scientist market is 5X the size of the professional market and it’s easy to see why these platform developers would slant this way.

 

Some Market Observations

In talking with these developers, those pitching most directly to the non-data scientist market are indeed getting pushback from internal data science teams.  They also report getting a warm reception from line-of-business managers who are suffering from the bottlenecks. 

An interesting theme that emerges is that sales are most likely where no formal in-house data science group exists or if the company is currently outsourcing its model building.  According to one recent study, this group still accounts for about 60% of all businesses though we all understand that penetration in the largest companies is already 100%.

Another observation is that the greatest pushback comes from companies who have teams of very young data scientists.  The explanation provided is that these recent grads still think that all the operational requirements of predictive analytics can be performed directly in R or Python on which they were taught.  Teams of more mature data scientists have figured out the limitations of using scripting languages and are more inclined to accept proprietary platforms as an additional tool.

 

Who Is In the Market?

In our first article we reviewed:

Thanks to our readers we’ve identified three additional competitors that we describe here. (In alphabetical order:)

 

Compellon (www.compellon.com)

I had the pleasure of a demo and long conversation with Nikolai Liashenko, Chief Data Scientist and Marc Bir, Chief Technology Officer.  Compellon is firmly placed in the one-click non-data scientist market and has a very nicely developed UI with an emphasis on interpretability and transparency suitable for regulated markets. 

This includes displaying the variable impacts for each model down to the individual customer level showing that the decision for one customer may have been influenced by different variables than for another customer.  It also facilitates ‘what if’ analysis that would be most relevant to LOB users.

What’s distinctly different about Compellon is the underlying data science.  They have developed a proprietary AI model generator that does not rely on any of the known statistical modeling algorithms and is therefore difficult to describe.  Based on Nikolai Lyashenko’s own lifetime research in information theory the engine quickly produces good quality models but without reference to classical feature selection or hyperparameter tuning.  Lyashenko says that the generator frequently produces models with the characteristics of deep neural nets but that DNNs are not used in model creation.

Compellon also uses a unique definition of feature engineering relating to identifying and ‘combining’ variables with extremely strong predictive capability.  Classical data cleaning and feature engineering are not required but could be accomplished outside the system before submitting the data.

Another interesting feature not found in other one-clicks is the optional ability to run a segmentation as part of the model that can conceivably deliver back as many as 10 segment-specific models.

They plan to publish comparative benchmark accuracy data later this year and claim that their current in house tests have shown very high levels of fitness.

  1. Blending: no, starts with analytic flat file.
  2. Cleanse:  no, missing data, outliers, miscodes need to be handled before loading. However Compellon states that their proprietary engine requires very little preprocessing or cleaning.
  3. Impute and Transform:  Not in the classical sense.  See description above.
  4. Feature Engineering:  Again not in the classical sense though Compellon describes their system’s identification of ‘super predictors’ as a form of data engineering.
  5. Feature Selection:  yes.
  6. Select ML Algorithms to be utilized:  Proprietary AI-based model generator.
  7. Create Ensembles: no.
  8. Run Algorithms in Parallel: no.
  9. Adjust Algorithm Hyperparameters during model development: no – not relevant to their proprietary engine.
  10. Select and deploy:  Only a single champion model is presented.  Deploy via Java, or API.

 

DMWay (www.DMWay.com)

I had the pleasure of a demo and conversation with DMWay CEO Gil Nizri and CTO Ronen Meiri.  They have elected to pursue an ultra-simple but sophisticated approach targeting non-data scientists.  DMWay offers only GLM as a modeling tool and has developed a nice suite of preprocessing and feature selection tools to round out their easy-to-use platform.  Their focus at least initially is on regulated markets (banking, insurance and lending) where interpretability is key and the volume of models is high, but looks to expand into all types of users.

  1. Blending: no, starts with analytic flat file.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Feature Engineering: Some.  More sophisticated automatic creation of for example ratios from related variables is to follow.
  5. Feature Selection:  yes
  6. Select ML Algorithms to be utilized:  Only GLM, selection of linear or logistic.
  7. Create Ensembles: no.
  8. Run Algorithms in Parallel: no, only one algorithm to run.
  9. Adjust Algorithm Hyperparameters during model development: yes – some access to adjust these for knowledgeable users.
  10. Select and deploy:  Only a single champion model is presented.  Deploy via R, Java, or SQL.

 

TIMi Suite from Business Insights (www.timi.eu or www.business-insight.com)

From my interview with Frank Vanden Berghen the Director and founder of Business Insights and the TIMi platform, they are pursuing both the professional data science market and the non-professional one-click market.  For the latter TIMi (The Intelligent Mining Machine) comes with the expected much simpler UI allowing fully automated operation. 

Based in Belgium with representation in the US, TIMi is the only suite we encountered that laid claim to many years of significant wins and high placing in various competitions, including most recently a 9th place in a 2015 Kaggle contest.  TIMi also offers complete SAS integration.  Some clients are reported to be using TIMi up through feature engineering then exporting the dataset.

  1. Blending: yes with their separate Anatella ETL product including direct access to HDFS.
  2. Cleanse:  yes
  3. Impute and Transform:  yes
  4. Feature Engineering: yes including at least ratios, binning, and data range conversions.
  5. Feature Selection:  yes
  6. Select ML Algorithms to be utilized:  no, automatically runs logistic regression, decision trees, and elastic net.
  7. Create Ensembles: no.
  8. Run Algorithms in Parallel:  yes.
  9. Adjust Algorithm Hyperparameters during model development: yes – access to adjust these in the professional version interface.
  10. Select and deploy:  Only a single champion model is presented.  Deploy via Java, SQL, or PMML.

 

Having That Conversation with Management

Whether you are pro or con automated machine learning, there will come a time when you have to explain to management the risks involved in allowing non-data scientists to produce production predictive models.  Management will also want to know how those risks can be mitigated and whether this should be allowed at all.  Chances are very high that making this explanation will fall to you, the in-house data scientist.

However you chose to handle this, with resistance, or by ‘democratizing’ the process is up to you.  It’s clear however that the AML market is just gaining traction and that you’ll be seeing more and more of them. 

To most of us this seems like a natural progression to automate what can be safely automated and preserve our time for the creative portions of data science.  We suggest you be proactive and check out some of these, then draw your own conclusions.

 

 

About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected]

Views: 2367

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sylvain Ferrandiz on August 28, 2017 at 5:26am

The topic of AutoML is hot these days, indeed. In my opinion, it is not a question about 'algorithms replacing people' (I understand that such headlines generate a lot of clicks, though). It is more about how some well designed autoML algorithms can help expert data scientists focus on what they really like and are worth (connecting to the business, understanding what drives a specific behavior, analyzing predictive performance, put things to production, monitor models, etc.), and empowering non expert users as a lot of companies often build data teams with different blends of coding / maths / business skills (which is a good idea, in my mind).

But I may be biased, as I'm the product owner of PredicSis.ai, which falls into the category of one-click data-in model-out. Very briefly, with respect to the steps used above to evaluate the solutions, this is how I may position PredicSis.ai:

  1. Blending: yes, starts with relational datasets
  2. Cleanse: no, missing data, outliers, miscodes need to be handled before loading. However the ML algorithms we developed are not outlier sensitives and handle missing data, very little preprocessing or cleaning so.
  3. Impute and Transform: no.
  4. Feature Engineering: yes, engineering features from relational datasets (i.e. aggregates).
  5. Feature Selection:  yes.
  6. Select ML Algorithms to be utilized: no. Proprietary autoML model generator.
  7. Create Ensembles: no, the user cannot blend models. However, behing the scene, the autoML model generator is using Bayesian Averaging of weak learners. 
  8. Run Algorithms in Parallel: no. Not required as the autoML algorithms we develop are designed to be resource savvy (CPU / RAM / wall-clock time). We design algorithms with linearithmic complexities (time and space)
  9. Adjust Algorithm Hyperparameters during model development: no – not relevant as the autoML algorithms we develop are nonparametric. We design algorithms to reach the best trade-off between robustness (predictive performance on new unseen data is as stable in time as possible) and performance (the estimated performance on past data is as high as possible)
  10. Select and deploy: the user selects and deploys her favourite model among the ones she trained. Deploy via GUI or via an open-source Python SDK (wrapping PredicSis.ai API).

The autoML algorithms development started in 2004, PredicSis has been founded in 2013 and PredicSis.ai v1.0 was released in 2015. Current version is v3.7. V4.0 will be released this fall. These algorithms have been published in academic papers and benchmarked through academic challenges (like the ChaLearn AutoML Challenge).

As a video sometimes is worth a thousand words, we have this 5mn video to get a sense of a possible user experience among different ones. Some are using PredicSis.ai to understand drivers of a specific outcome, some others to assess whether a specific behavior is drifting through time, some others to perform uplift modeling, and the list goes on.

Customers using PredicSis.ai range from small, data-driven startups empowering their single data analyst (it appears we have to say 'citizen data scientist' nowadays :-D ), to big companies with expert data science teams building what they call a 'model factory', building, deploying and maintaining hundreds of models.

Hope this comment can be useful. If not, tell me.

Whatever your thoughts about this post are, always happy to talk about autoML, don't hesitate to reach me!

Sylvain.

Comment by Kimberly Stewart on August 21, 2017 at 4:01am

I love learning more about this subject of Citizen Data Scientist and the AML platforms.  Spent 2 hours yesterday morning, really digging deep into the details behind this subject to include a handful of your blogs, articles, etc.  LOVE IT.  Keep it coming.  And thank you.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service