“Predictive analytics” is a commonly used term today. Wikipedia describes it as ‘encompassing a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events’. This is a fairly accurate description and I believe the term is generally well understood. However, if you go a bit deeper and look at the process of building a predictive model, it is not so straightforward. So my attempt in this article is to talk about some of the basic principles used in building predictive models. To do so, I am going to pose some questions and answer them myself. Although the discussion is necessarily technical, it is at a rather high level and can benefit even those who are not involved in building models hands-on.
use information from the past, i.e., historical data, to make an inference about the future. An implicit assumption is that historical patterns are going to repeat in the future. If this assumption is invalid for any reason, the prediction made by the model in question is unlikely to be reliable.
Not necessarily. Even a simple correlation check can help make a predictive inference. Consider two-time series, and . If (t) is highly correlated with (t+1), then it means that having information about at time t implies being able to predict at time (t+1) with a reasonable accuracy.
The key to building a good predictive model is not in using any fancy math but in ensuring that the dependent and independent variable are defined carefully and the fallacy of using future information to make an inference about the same future is avoided. Let me elaborate on this last point with an example. Assume that available historical data includes customer behaviour data including credit card payment history and response to a quarterly loan offer from Q1-2012 to Q1-2014, and the objective is to build a model to predict the customer’s response to the loan offer based on past payment history. One possible way of building this model is to use the response to loan offer in Q1-2014 as the dependent variable and use the payment history from Q1-2012 to Q4-2013 to form independent variables. It would be incorrect to use payment history from Q1-2012 to Q1-2014 to form independent variables though, because in that case we will be using the payment behavior in Q1-2014 to “predict” the response to the loan offer in the same timeframe!
There is no clear-cut answer to this question. Although certain business problems are more amenable to being modelled using certain kinds of statistical techniques, typically the efficacy of the model is determined by the data used to build it. A model that has access to a richer data source will generally be more effective. With a given data source, better results may be obtained by being creative about deriving new variables from the available data. As an example, consider the task of modelling the parabola y=x using OLS regression on (x,y) values. Since the technique used is linear in its parameters, if x is taken as the independent variable, the results won’t be great. But if you define x as a derived variable and use it as the independent variable for regression, the model will be a perfect fit!
A predictive model is typically built using supervised learning (regression models, decision trees, etc.), but it is possible to use unsupervised learning to make a predictive inference. Clustering algorithms use unsupervised techniques. Imagine a clustering solution obtained by clustering customer behavior data up to time ‘t’. If you overlay the customers’ response to a certain offer at time (t+1) on the clusters obtained previously and find that there is a good variation in response values across clusters, then the clustering solution can be used to make a predictive inference about response to that particular offer.