Any time series classification or regression forecasting involves the Y prediction at 't+n' given the X and Y information available till time T. Obviously no data scientist or statistician can deploy the system without back testing and validating the performance of model in history. Using the future actual information in training data which could be termed as "Look Ahead Bias" is probably the gravest mistake a data scientist can make. Even the sentence “we cannot make use future data in training” sounds too obvious and simple in theory, anyone unknowingly can add look ahead bias in complex forecasting problems.
The discussion becomes important when you put in so much efforts in researching and building the model only to realize later that the back testing framework was using future data. It will also cost the data scientist a lot when the model is approved by Top Management and at the time of deploying the model realizing that we don’t have the future data.
Here in this article, I suggest some simple checkpoints which might help in avoiding look ahead bias. Not all points mentioned could be relevant and directly applied to the problems at hand. At the end it is better to have bad results using correct framework than good results using wrong framework.
Suppose Mr. ABC wants to predict the number of hourly transactions, 2 hours from now. The client has given him data till 11 AM. The first thing ABC can do is to build a simple model using data till 11 AM and see if he can get forecast for 1 PM. This will ensure that data available till time 't' is used to build model to forecast for 't+n'.
The last window is important as it will also mimic the real time implementation too. The similar last window method can then be applied in a rolling window fashion for back testing where 't' can be index of a window.
Wherever possible, the first cut model could be built on small sample data set before adding complexity. In the small data set, make sure that we are not inducing any look ahead bias. Once the confidence is built, start building the actual complex framework.
Consider an example where a Mr. ABC is building a model to predict 5 mins ahead stock movement. That means he would like to know what happens at 10:05AM based on information available till 10:00 AM. The last information he can use for feature calculation could be maximum only till 9:55 AM and time from 9:55 AM till 10:00 AM can be kept for labels. If by any chance, Mr. ABC used data from 9:55 AM till 10:00 AM in feature computation it will lead to look ahead bias.
Let us take an example of monthly sales prediction using monthly macroeconomic indicators. Mr. ABC is assigned to predict the monthly sales 2 months ahead at the end of each month. Sales number of the current month are updated on the last day of each month. Let’s assume today is 31st March and ABC wants to predict for June. In this case if one of the macroeconomic indicator value for March is getting published in 3rd week April then ABC cannot use the same month value in back testing and will have to consider an extra lag of the variable.
The cross validation framework for time series is different than the CV used in normal classification framework where time stamp of the prediction is not important. The regular CV which randomizes the data points can induce serious bias.
Too many efforts in to improving validation results to decide feature set is again a type of look ahead bias. Off course this problem is very hard to tackle but some care can be taken when deciding the features. In ideal case the features should be decided first based on strong fundamental understanding of the domain and problem and then incorporated in the model.
This will also take care of over-fitting problem to some extent. The idea is to guess which features might work and then validating rather than justifying why that feature worked based on results.
The data scientist should have fair guesstimate of MAPE / accuracies before even starting the prediction problem and then should compare the results with the guesstimate. Too much deviation could also signify the look ahead bias. Many times even getting a guesstimates could be tough task. But in that case, one should also ask the question “Are results too good to trust? “
If addition of new feature/s causing sudden jump in accuracy, then take a one should pause rather than rejoicing and check if the is using some future data.
Given all these points, taking precautions at each step adding random multiple checks on process data and understanding the meaning of results at each step would also help in avoiding the look ahead bias.
Please follow the author on Linked here
To read original post, click here