Summary: Some observations about new major trends and directions in data science drawn from the Strata Data conference in San Jose last week.
I’m just back from my annual field trip to the Strata Data conference in San Jose last week. Strata is on tour with four more stops before repeating next year. The New York show is supposed to be a little bigger (hard to imagine) but the San Jose show is closest to our intellectual birthplace.
My personal goal is to soak in the headwaters of AI, analytics, and big data and not only learn as much as I can about specific techniques and products, but to try to see where our profession as a whole is headed. What a difference a year makes.
I’ll try to briefly share the major themes and changes I found this year and will write later in more depth about some of these.
Data Science Hits Maturity
Our profession has been speeding forward at the pace of dog years. You know that old saw about 1 dog year equaling 7 people years. Just three or four years ago we were bright youngsters with shiny new toys and techniques (Hadoop, ML, ensembles). As recently as last year we were tempestuous teens showing off our newest moves (Spark, GPUs, deep learning). Suddenly we’ve turned into millennials now more concerned with paying the rent and getting a good job.
All of which is to say that this year was notable for its lack of any big new shiny trick and more about infrastructure, data fabrics, and making sure all those previously discovered neat things work well together.
We used to talk about ‘platform convergence’ as something that was underway and would play out in the future. Well that happened. They’ve converged.
By actual count 75% of this year’s exhibitors were pitching either end-to-end platforms or a major enhancement to those. Enterprise data fabric is perhaps the most used term.
In the past, if you were big enough, you bought and tried everyone’s newest offering and built one or more projects around that application’s strengths. This is what we used to call a ‘best of breed’ solution.
But the counter argument that has some merit is that this actually resulted in siloing of skills and resources around specific applications and projects that worked against the overall efficiency and speed of the data science.
It’s not that there aren’t differences between platforms. It’s that they all do everything pretty well so if you’re shopping you’ll really have to get down in the weeds to find the differences that matter to you.
Some of the major subthemes among these exhibitors:
- Ease of model implementation: Many vendors were making the case that moving from model to implementation is deeply flawed and in need of automated processes to help.
- Operationalizing Streaming: Spark 2.0 by itself no longer seems to be enough. A fairly large group of vendors is reinventing the streaming platform abstracting away the details to make them easy and enterprise-worthy.
- GDPR and Personally Identifying Information: With GDPR just weeks away many were offering features designed to meet the new regs. There seems to be an assumption that the requirements of GDPR are so elemental that even if you are not doing business in the EU you might as well adopt since running systems with multiple standards would simply be too complex. Others offered features in which PII was automatically identified, sequestered, and masked. Not a bad idea if it’s 100% reliable.
Artificial Intelligence Everywhere – Or Was It?
If you’ve read my previous laments about the overuse of the term AI you know I can really get my knickers in a twist over this. On the vendor side of the room all of data science is now just part of ‘AI’.
Pretty much every exhibitor has a story about how they incorporate AI – even when they don’t. Mostly this occurs when vendors are presenting some improvement to good old predictive analytics. You know, supervised and unsupervised modeling with lots of well-defined features. That’s not AI.
It’s mostly the fault of how the popular press now talks about data science. Every article, blog, or comic book now says that anything that supports or replaces a human decision is AI. In a sense, how can you blame the platform developers? If all your new and existing customers are clamoring for AI then let’s tell them that everything is AI. Problem solved.
I suspect I’ll have to give up my crusade. We’re too far along the path of renaming.
Deep Learning – The Real AI
The good news is that the non-vendor presenters still know the difference between AI and predictive analytics and their presentations carefully parsed this reality. Of course AI is broader than deep learning since it also incorporates reinforcement learning. These toolsets have in turn given us image, text, and speech applications, game play, self-driving cars, chatbots, and Watson-like question answering machines. When you talk to a data scientist, this is what we mean by AI.
As for the tools, several of the major platforms and some of the challengers now offer DNNs as part of the toolset. While there is some claim to having simplified their operation, no one would go so far as to say they have licked the very difficult problems of automating hyperparameter tuning on DNNs.
The issue of how many nodes and how many layers remains a major challenge that we hope will someday be automated. The varying nature of the problems themselves make this an incredibly wide and deep problem. At least one vendor claimed to be developing a library of ‘similar models’, not so that they could be copied as in transfer learning, but so that developers could use their hyperparameter nodes and layers as a guide to new creation.
Unlike last year there were no giant oversubscribed classrooms teaching hundreds of aspiring data scientists how to improve their career prospects with Tensorflow. It’s just that Tensorflow and the other platforms have become more accessible, more mainstream, and as a practitioner you’ve probably already decided if you’re going to invest the time to master it.
Citizen Data Scientists – Democratization – Simplification
Data democratization is code for making data more readily accessible to LOB managers and analysts. Those tools also benefit data scientists and the emerging career now called data engineering. They also indirectly benefit IT which used to bear the brunt of this workload.
There were new tools and platforms this year focusing on blending, prep, cleaning, and generally preparing data for the analytics that have long been the domain of the LOB manager/analyst. That is mostly retrospective BI-oriented tables and data viz dashboards. BI is alive and still valued and these improvements are welcome.
The less comfortable angle on this was started by Gartner three or four years ago when they made two claims: 1). supervised and unsupervised modeling was becoming sufficiently automated and simplified that LOB Managers/analysts could start producing models without data scientists; and 2) (if you believed number 1) that this markets would grow roughly 7X faster than the professional data science user market.
Advanced analytic platform developers (and the VCs that back them) bought into this completely and suddenly the Citizen Data Scientist (CDS) was the new customer to capture.
There were 13 advanced analytic platform vendors exhibiting this year and with extremely rare exception, almost all said their push to simplification and automation was in pursuit of the CDS market.
I’m actually a fan of what’s going on in the automated or simplified machine learning field. But my rationale is firmly based in the efficiency these promise for trained data scientist allowing fewer of those rare resources to do the work that used to require many. It’s about efficiency and effectiveness.
Especially among the new entrants there is a very specific focus on CDSs. Some of these are almost fully automated one-click platforms similar to DataRobot. In the 18 months roughly that I’ve been tracking this market I now count about 20 standalone entrants. Then there’s a group immediately behind that have simplified and automated the individual steps in the process but not fully automated (e.g. BigML). This includes the majors like SAS and SPSS.
There are no big new improvements in the underlying data science this year and that’s reflected in a bit of ‘me too’ enthusiasm among analytic platform offerings. Looks to me like shake out and consolidation must be on the horizon.
The best description I can offer for where we’re at this year is that this is now the implementation and utilization phase. Let’s take what we’ve developed over the last three or four years and turn them into profitable products. VCs saw this coming back then and their startups are now also three or four years old and ready to cash in on what we all hope is an age of commerce enhanced by all aspects of data science.
As for the next shiny new things, we’re all waiting for GANS and reinforcement learning to make big moves but that may take a year or more.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001. He can be reached at: