As a product manager in the domain of predictive analytics, I own the responsibility to build predictive analytics capabilities for consumer facing and/or enterprise platforms; the business applications vary among item recommendations for consumers, prediction of event outcomes based on classification models, demand forecasting for supply optimization, and so on. We usually see the applications where the predictive model built using machine learning technique(s) is leveraged to score the new set of data, and that new set of data is most often fed to the model on-demand as a batch.
However, the more exciting aspect of my recent work has been in the realm of real-time predictive analytics, where each single observation (raw data point) has to be used to compute the predicted outcome; note that this is a continuous process as the stream of new observations continuously arrive and the business decisions based on the predicted outcomes have to be made in real-time. A classic use case for such a scenario is the credit card fraud detection: when a credit card swipe occurs, all the data relevant to the nature of the transaction is fed to a pre-built predictive model in order to classify if the transaction is fraudulent, and if so deny it; all this has to happen in a split second at scale (millions of transactions each second) in real-time. Another exciting use case is the preventive maintenance in Internet of Things (IoT), where continuous streaming data from thousands/millions of smart devices have to be leveraged to predict any possible failure in advance to prevent/reduce downtime.
Let me address some of the common questions that I often receive in the context of real-time predictive analytics.
What exactly is real-time predictive analytics – does that mean we can build the predictive model in real-time? A data scientist requires an aggregated mass of data which forms the historical basis over which the predictive model can be built. The model building exercise is a deep subject by itself and we can have a separate discussion about that; however, the main point to note is that model building for better predictive performance involves rigorous experimentation, requires sufficient historical data, and is a time consuming process. So, a predictive model cannot be built in “real-time” in its true sense.
Can the predictive model be updated in real-time? Again, model building is an iterative process with rigorous experimentation. So, if the premise is to update the model on each new observation arriving in real-time, it is not practical to do so from multiple perspectives. One, the retraining of the model involves feeding the base data set including the new observation data point (choosing either to drop older data points in order to keep the data set size the same or not drop and keep growing the data set size) and so requires rebuilding of the model. There is no practical way of “incrementally updating the model” with each new observation; unless, the model is a simple rule based; for example: predict as “fail” if the observation falls outside the two standard deviations from the sample mean; in such a simple model, it is possible to recompute and update the mean and standard deviation values of the sample data by including the new observation even while the outcome for the current observation is being predicted. But for our discussion on predictive analytics here, we are considering more complex machine learning or statistical techniques.
Second, even if technologies make it possible to feed large volume of data including the new observation each time to rebuild the model in a split second, there is no tangible benefit in doing so. The model does not much with just one more data point. Drawing an analogy, if one wants to measure by how much the weight has reduced from an intensive running program, it is common sense that the needle does not move much if measured after every mile run. One has to accumulate a considerable number of miles before experiencing any tangible change in the weight! Same is true in Data Science. Rebuild the model only after aggregating a considerable volume of data to experience a tangible difference in the model.
(Even the recent developments, such as Cloudera Oryx, that are making efforts to move forward from Apache Mahout and similar tools (limited to only batch processing for both model building and prediction) are focused on real-time prediction and yet rightly so on batch-based model building. For example, Oryx has a computational layer and a serving layer, where the former performs a model building/update periodically on an aggregated data at a batch level in the back-end, and the latter serves queries to the model in real-time via an HTTP REST API)
Then, what is real-time predictive analytics? It is when a predictive model (built/fitted on a set of aggregated data) is deployed to perform run-time prediction on a continuous stream of event data to enable decision making in real-time. In order to achieve this, there are two aspects involved. One, the predictive model built by a Data Scientist via a stand-alone tool (R, SAS, SPSS, etc.) has to be exported in a consumable format (PMML is a preferred method across machine learning environments these days; we have done this and also via other formats). Second, a streaming operational analytics platform has to consume the model (PMML or other format) and translate it into the necessary predictive function (via open-source jPMML or Cascading Pattern or Zementis’ commercial licensed UPPI or other interfaces), and also feed the processed streaming event data (via a stream processing component in CEP or similar) to compute the predicted outcome.
This deployment of a complex predictive model, from its parent machine learning environment to an operational analytics environment, is one possible route in order to successfully achieve a continuous run-time prediction on streaming event data in real-time.