Any time series classification or regression forecasting involves the Y prediction at 't+n' given the X and Y information available till time T. Obviously no data scientist or statistician can deploy the system without back testing and validating the performance of model in history. Using the future actual information in training data which could be termed as *"Look Ahead Bias"* is probably the gravest mistake a data scientist can make. Even the sentence *“we cannot make use future data in training”* sounds too obvious and simple in theory, anyone unknowingly can add look ahead bias in complex forecasting problems.

The discussion becomes important when you put in so much efforts in researching and building the model only to realize later that the back testing framework was using future data. It will also cost the data scientist a lot when the model is approved by Top Management and at the time of deploying the model realizing that we don’t have the future data.

Here in this article, I suggest some simple checkpoints which might help in avoiding look ahead bias. Not all points mentioned could be relevant and directly applied to the problems at hand. At the end it is better to have bad results using correct framework than good results using wrong framework.

Suppose Mr. ABC wants to predict the number of hourly transactions, 2 hours from now. The client has given him data till 11 AM. The first thing ABC can do is to build a simple model using data till 11 AM and see if he can get forecast for 1 PM. This will ensure that data available till time 't' is used to build model to forecast for 't+n'.

The last window is important as it will also mimic the real time implementation too. The similar last window method can then be applied in a rolling window fashion for back testing where 't' can be index of a window.

Wherever possible, the first cut model could be built on small sample data set before adding complexity. In the small data set, make sure that we are not inducing any look ahead bias. Once the confidence is built, start building the actual complex framework.

Consider an example where a Mr. ABC is building a model to predict 5 mins ahead stock movement. That means he would like to know what happens at 10:05AM based on information available till 10:00 AM. The last information he can use for feature calculation could be maximum only till 9:55 AM and time from 9:55 AM till 10:00 AM can be kept for labels. If by any chance, Mr. ABC used data from 9:55 AM till 10:00 AM in feature computation it will lead to look ahead bias.

Let us take an example of monthly sales prediction using monthly macroeconomic indicators. Mr. ABC is assigned to predict the monthly sales 2 months ahead at the end of each month. Sales number of the current month are updated on the last day of each month. Let’s assume today is 31st March and ABC wants to predict for June. In this case if one of the macroeconomic indicator value for March is getting published in 3rd week April then ABC cannot use the same month value in back testing and will have to consider an extra lag of the variable.

The cross validation framework for time series is different than the CV used in normal classification framework where time stamp of the prediction is not important. The regular CV which randomizes the data points can induce serious bias.

Too many efforts in to improving validation results to decide feature set is again a type of look ahead bias. Off course this problem is very hard to tackle but some care can be taken when deciding the features. In ideal case the features should be decided first based on strong fundamental understanding of the domain and problem and then incorporated in the model.

This will also take care of over-fitting problem to some extent. The idea is to guess which features might work and then validating rather than justifying why that feature worked based on results.

The data scientist should have fair guesstimate of MAPE / accuracies before even starting the prediction problem and then should compare the results with the guesstimate. Too much deviation could also signify the look ahead bias. Many times even getting a guesstimates could be tough task. But in that case, one should also ask the question “Are results too good to trust? “

If addition of new feature/s causing sudden jump in accuracy, then take a one should pause rather than rejoicing and check if the is using some future data.

Given all these points, taking precautions at each step adding random multiple checks on process data and understanding the meaning of results at each step would also help in avoiding the look ahead bias.

Please follow the author on Linked here

To read original post, click here

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central