“Predictive analytics” is a commonly used term today. Wikipedia describes it as ‘** encompassing a variety of statistical techniques from modeling, machine learning, and data mining that analyze current and historical facts to make predictions about future, or otherwise unknown, events**’. This is a fairly accurate description and I believe the term is generally well understood. However, if you go a bit deeper and look at the process of building a predictive model, it is not so straightforward. So my attempt in this article is to talk about some of the basic principles used in building predictive models. To do so, I am going to pose some questions and answer them myself. Although the discussion is necessarily technical, it is at a rather high level and can benefit even those who are not involved in building models hands-on.

**Predictive models are about predicting the future. Why do they need historical data?**

Predictive models use information from the past, i.e., historical data, to make an inference about the future. An implicit assumption is that historical patterns are going to repeat in the future. If this assumption is invalid for any reason, the prediction made by the model in question is unlikely to be reliable.

**Do predictive models always involve complicated math?**

Not necessarily. Even a simple correlation check can help make a predictive inference. Consider two-time series, **x** and **y**. If **x**(t) is highly correlated with **y**(t+1), then it means that having information about **x** at time t implies being able to predict **y** at time (t+1) with a reasonable accuracy.

The key to building a good predictive model is not in using any fancy math but in ensuring that the dependent and independent variable are defined carefully and the fallacy of using future information to make an inference about the same future is avoided. Let me elaborate on this last point with an example. Assume that available historical data includes customer behaviour data including credit card payment history and response to a quarterly loan offer from Q1-2012 to Q1-2014, and the objective is to build a model to predict the customer’s response to the loan offer based on past payment history. One possible way of building this model is to use the response to loan offer in Q1-2014 as the dependent variable and use the payment history from Q1-2012 to Q4-2013 to form independent variables. It would be incorrect to use payment history from Q1-2012 to Q1-2014 to form independent variables though, because in that case we will be using the payment behavior in Q1-2014 to “predict” the response to the loan offer in the same timeframe!

**Which statistical techniques give better predictive models?**

There is no clear-cut answer to this question. Although certain business problems are more amenable to being modelled using certain kinds of statistical techniques, typically the efficacy of the model is determined by the data used to build it. A model that has access to a richer data source will generally be more effective. With a given data source, better results may be obtained by being creative about deriving new variables from the available data. As an example, consider the task of modelling the parabola y=x^{2} using OLS regression on (x,y) values. Since the technique used is linear in its parameters, if x is taken as the independent variable, the results won’t be great. But if you define x^{2} as a derived variable and use it as the independent variable for regression, the model will be a perfect fit!

**Can unsupervised learning be used to yield a predictive model?**

A predictive model is typically built using supervised learning (regression models, decision trees, etc.), but it is possible to use unsupervised learning to make a predictive inference. Clustering algorithms use unsupervised techniques. Imagine a clustering solution obtained by clustering customer behavior data up to time ‘t’. If you overlay the customers’ response to a certain offer at time (t+1) on the clusters obtained previously and find that there is a good variation in response values across clusters, then the clustering solution can be used to make a predictive inference about response to that particular offer.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central