In 2013 Thomas Davenport proposed that analytic leaders are leaving Analytics 2.0 and entering the era of Analytics 3.0. To briefly recap, Analytics 1.0 ran from the 1950s (when UPS instituted the first corporate analytics group in the US) through the mid-2000s and the birth of NoSQL and Hadoop. It’s best known for the enterprise data warehouse and backwards-looking historical analysis.
Analytics 2.0 is marked by the rise of Big Data (volume, variety – especially unstructured, velocity – IoT) and faster analytics processing with in-database and in-memory techniques. It’s also marked by the rise of predictive analytics, the data scientist, and the creation of digital products (Google, Amazon, Facebook, etc.). Where in 1.0 most data was internal to the company, in 2.0 the focus expanded dramatically to data derived from external sources such as the internet, sensors, the human genome, and text or videos.
We’ve been living in Analytics 2.0 for the last decade and it’s not a bad place, but Davenport argues that the leaders are breaking out into Analytics 3.0. Who exactly are these leaders? Especially when (frustratingly) the majority of the middle market hasn’t even achieved 2.0.
For starters we’re not talking about companies creating our digital products (same list as above). We are talking about the largest banks, insurance and financial institutions, healthcare organization, retailers, and manufacturers who are the major adopters of advanced analytics.
What set’s 3.0 adopters apart? Davenport lists these characteristics:
- Analytics is integral to running the business and considered a strategic asset.
- Delivery of insights must be very rapid and moves past agile development techniques into new, largely uncharted techniques. Insights that took weeks now must be delivered in a day or less.
- Analytical tools are available at the point of decision, especially including analytics embedded in operational systems to create automated prescriptive best outcomes.
- Analytics becomes part of the culture and is embedded in all decision and operational processes.
- Businesses regularly create data-based products and services using their newly capable data gathering and analytic capabilities to bind customers closer to their core businesses.
It’s a fascinating argument. Personally I spend more time trying to bring the middle market guys up to 2.0 but if 3.0 is the future then eventually they’ll catch up. When I try to distill the forces that are at the essence of Davenport’s analysis I see three things.
- This is not a revolution but an incremental change involving better, faster technology architectures and techniques that are already known and adopted. A hybrid of our current tool set and emerging techniques like streaming analytics.
- The volume of analytics in these leading organizations is growing very rapidly. This means rethinking how we organize internally and how we resolve the data science talent shortage.
- At its core it’s all about speed. This means not only time to action (which we can address by embedding our predictive and recommender algorithms directly in operational systems to be scored on the fly) but also about time-to-insight (how long as data scientists does it take us to produce that production-ready insight).
There are lots of interesting angles to this but the one that caught my attention was whether or not our current ‘analytic stack’ is up to the task of the volume and speed. Let’s break it into four parts:
ETL: Acquisition and integration of data of different varieties and velocities into that ‘big flat file’ our ML techniques require.
Data prep (munging). Where we as data scientists spend 80% of our time cleaning and exploring (less creative) but also transforming and imputing (more creative) so that our ML tools can find the most signal in the data.
Modeling. Not only are there more specific tools than ever (regression, decision trees, SVM, neural nets, random forests, etc.) but when you add in the potential for very large ensembles you get a combinatorial explosion of techniques.
Operationalizing. How do we get the code for that champion model out of our analytics platform into the operational system and make sure it’s appropriately updated.
Let’s take a look at these one by one and try to decide what the minimum capability of each should be to support Analytics 3.0. We’ll even give each a grade to show where we think we stand today versus where we need to be to support 3.0.
ETL: (Grade B+) Here I think we’re in pretty good shape. Your analytics platform (like SAS or SPSS) may have multi-source blending built in, or you may use a frontend like Alteryx to accomplish this. Likewise in streaming data Spark and Storm among others are doing a good job of ingesting the data so that our systems can integrate it with other non-streaming sources.
Data Prep (Munging): (Grade D+ / C- but improving) This is an area I’ve always been personally uncomfortable about automating but perhaps I’ll have to change my opinion. There are several standalone platforms that claim to do this and for the last week or so I’ve been beta testing a new one that performs very well.
It will automatically expand alpha categoricals, impute missing data, and even enrich (transform) the features (including GIS and date/time data) so some of the creative work is handled. In the case of this new platform I’m uncomfortable that I can’t see exactly what’s going on or what techniques have been used (for example for imputing missing data) but the accuracy of the models coming out has been surprisingly high.
The ‘expert’ platforms like SAS have held back so far from automating this but clearly they’re trying to make it easier for the expert practitioner. There are also some tools like Linear Genetic Programs of which I am a fan that will eliminate the need to transform by trying millions of different alternatives. Unfortunately LGPs aren’t included in any of the major platforms so the one I use is a standalone.
So this is an area for greater improvement. Personally I would always like to be able to see if not select exactly what’s going on in this step. And as an absolute requirement, I want to be able to export the prepped data to run on other platforms ‘just to make sure’. However, both these requirements slow down the process and work against time-to-insight so as they become better perhaps I’ll have to give in some on this.
Modeling: (Grade A-, always room for improvement) All the major analytic platforms allow the data scientist to simultaneously run multiple ML techniques on the same data then display fitness results designed to make selecting the champion algorithm pretty obvious. Given MPP this is now a kind of race to see how many models you can automatically generate in as little time as possible. The system I am beta testing runs a little over 9400 models in two or three minutes (probably less than 30 minutes for really Big Data size problems). Most of this advancement is aimed at creating ensemble models which can improve accuracy. I am a little concerned that some of these ensemble models, particularly decision tree ensembles, create production code that is thousands of lines long. At some point this is going to have an impact on latency in our production systems.
One more thing though, your analytics platform MUST be able to export code suitable to be implemented in your operational system. To reduce latency there’s Lambda architecture, and we’re just now seeing the first signs of being able to develop totally new predictive models directly in streaming data in real time. We’ll have to wait and see about this last one.
Operationalizing Your Insights: (Grade A-) Remember that we are talking about Analytics 3.0 and large companies. That means they may have deployed hundreds if not thousands of models each of which needs first to be implemented and then to be monitored, managed, and refreshed. Some of the market leaders like SAS are rising to this occasion and even have a specialized product called the ‘Model Factory’ specifically for this kind of environment. As this volume of analytics becomes more common more competitors will enter this space. Tight integration between your modeling platform that would allow you to automatically deploy or redeploy a new or existing model would be a requirement to meet the need for speed.
A couple of final observations.
Does the market have room for wholly new analytic platforms? Maybe. Some that I’ve observed are trying to be general purpose and face the hurdle of being late starters in the face of well-established competitors. Some are attempting to package content and industry templates to appeal to narrow verticals like insurance or retail. New platforms competing on ease and speed will be well positioned for 3.0. Whatever the case, they must have all four capabilities (ETL, Prep, Model, Deploy) or else be tightly integrated with other market leading products that can do this. Otherwise they’ll be stuck in Analytics 2.0.
Are we going to end up sacrificing accuracy for speed? For some time I have been a believer that getting the most accurate model is financially important. Very small increases in model fitness leverage up into much larger increases in the overall campaign. But as models need to be developed, implemented, and refreshed more and more rapidly I am gradually becoming convinced that some types of automation can be successful. The platform that I’ve been beta testing for example produces high accuracy models (the difference between their automated output and my best independent efforts has been small). Moreover it does it in just a few minutes so it meets the speed requirement. Curiously, it’s not possible to export the code. I’ll have to talk to those guys.
Can we automate the data scientist out of the process? I think we can make the data scientists that we have immensely more productive but I don’t believe they can be replaced by the average well-meaning manager or analyst. There are still too many ways to get exactly the wrong answer from even the most sophisticated systems. There is still too much to be gained from the experience, expertise, and particularly creativity with data of a good data scientist.
About the author: Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist and commercial predictive modeler since 2001. He can be reached at: