Time series forecasting is hardly a new problem in data science and statistics. The term is self-explanatory and has been on business analysts’ agenda for decades now: The very first practices of time series analysis and forecasting trace back to the early 1920s.
The underlying idea of time series forecasting is to look at historical data from the time perspective, define the patterns, and yield short or long-term predictions on how – considering the captured patterns – target variables will change in the future. The use cases for this approach are numerous, ranging from sales and inventory predictions to highly specialized scientific works on bacterial ecosystems.
Although an intern analyst today can work with time series in Excel, the growth of computing power and data tools allows for leveraging time series for much more complex problems than before to achieve higher prediction accuracy.
Many machine learning and data mining tasks operate with datasets that have a single slice of time or don’t consider the time aspect at all. Natural language processing, image or sound recognition, and numerous classification and regression problems can be solved without time variables at all. For example, the sound recognition solution that we worked with entailed capturing specific teeth grinding sounds of patients as they slept. So, we weren’t interested in how these sounds change over time, but rather how to distinguish them from ambient sounds.
Time series problems, on the other hand, are always time-dependent and we usually look at four main components: seasonality, trends, cycles, and irregular components.
The graph above is a clear example of how trends and seasons work.
Trends. The trend component describes how the variable – drug sales in this case – changes over long periods of time. We see that the sales revenues of antidiabetic drugs have substantially increased during the period from the 1990s to 2010s.
Seasons. The seasonal component showcases each year’s wave-like changes in sales patterns. Sales were increasing and decreasing seasonally. Seasonal series can be tied to any time measurement. We can consider monthly or quarterly patterns for sales in midsize or small eCommerce, or track microinteractions across a day.
Cycles. Cycles are long-term patterns that have a waveform and recurring nature similar to seasonal patterns but with variable length. For example, business cycles have recognizable elements of growth, recession, and recovery. But the cycles themselves stretch in time differently for a given country throughout its history.
Irregularities. Irregular components appear due to unexpected events, like cataclysms, or are simply representative of noise in the data.
Today, time series problems are usually solved by conventional statistical (e.g. ARIMA) and machine learning methods, including artificial neural networks (ANN), support vector machine (SVM), and some others. While these approaches have proved their efficiency, the tasks, their scope, and our abilities to solve the problems change. And the mere set of use cases for time series today has a potential to be expanded. As statistics step into the era of big data processing, the Internet of Things providing limitless trackable devices, and social media analysis, analysts look for new approaches to handle this data and convert it into predictions.
So, let’s survey the main things that are happening in the field.
“Prediction is very difficult, especially if it’s about the future.”
Nils Bohr, Nobel laureate in Physics
Traditional forecasting methods strive to bring stationarity into time series, i.e. make a number of statistical properties repeat constantly over time. Raw data doesn’t usually provide enough stationarity to yield confident predictions. For instance, to the graph of antidiabetic drug sales above, we must apply multiple mathematical transformations to render non-stationary time series at least approximately stationary. Then we’ll be able to find patterns and make predictions that are more accurate than coin tossing, which is right in 50 percent of cases.
But time series in some fields are very resistant to our efforts as there are too many irregular factors that impact changes. Look at travel disruptions, especially those that happen during political unrest and the dangers of terrorism. Traveler streams change, destinations change, and airlines are adjusting their prices differently making year-old observations nearly obsolete. Or crude oil prices, which are critical to predict for players across many industries, haven’t permitted us to build time series algorithms that would be precise enough.
The traditional machine learning approach is to split an available historic dataset into two or three smaller sets to train a model and to further validate its performance against data that a machine hasn’t seen before. If we apply machine learning without the time series factor, a data scientist can choose the most relevant records from the available data and fit the model to them, leaving noisy and inconsistent records behind.
In time series, the main difference is that a data scientist needs to use a validation set that exactly follows a training set on the time axis to see whether the trained model is good enough. The problem with non-stationary records is that data in the training set might not be homogeneous to the testing set, as time series properties substantially change over the period that training and validation sets cover.
Here’s when we can use the stream learning technique. Stream learning suggests incremental changes to the algorithm – basically, its re-training. As a new record or a small set of them comes in, it updates the model instead of processing a whole set of data. This approach requires the understanding of two main things:
Data Horizon. How many new training instances are needed to update the model? For example, Shuang Gao and Yalin Lei from the China University of Geosciences recently applied stream learning to increase prediction accuracy in such non-stationary time series as crude oil prices mentioned above. They’ve set the data horizon as small as possible so that every update on the oil prices immediately updates the algorithm.
Data Obsolescence. How long does it take to start considering historical data or some of its elements irrelevant? The answer to this question may be quite tricky as it requires a share of assumptions based on domain expertise, basically, an understanding of how the market you work with changes and how many non-stationary factors bombard it. If your eCommerce business has significantly grown since last year both in terms of customer base and product variety, the data of the same quarter of the previous year may be considered obsolete. On the other hand, if the country experiences economic recession the new short-term data may be less enlightening than that of the previous recession.
While crude oil forecasts based on stream learning eventually perform better than conventional methods, they still show results that are only slightly better than a flipped coin does and stay in a ballpark of 60 percent confidence. They are also more complicated in development, deployment, and require prior business analysis to figure out data horizon and obsolescence.
Another way to struggle with non-stationarity is ensemble models. Ensembling uses multiple machine learning and data mining methods to further combine their results and increase predictive accuracy. The technique has nothing to do with new approaches in data science, but it has critical meaning in terms of business decisions related to data science initiatives.
Basically, while building robust forecasting is expensive and time-consuming, it doesn’t narrow down to making and validating one or two models with further choosing of the best performer. In terms of time series, non-stationary components – like different durations of cycles, low weather predictability, and other irregular events that have an impact across multiple industries – make things even harder.
This was the problem for the Google team that was building time series forecasting infrastructure to analyze business dynamics of their search engine and YouTube with further disaggregating these forecasts for regions and small-time series like days and weeks. With Google engineers recently disclosing their approach, it became clear that even the Mount Olympus of AI-driven technologies chooses simpler methods over complex ones. They don’t use stream learning yet and settle for ensemble methods. But the main point that they express is that you need as many methods as possible to get the best results:
“So, what models do we include in our ensemble? Pretty much any reasonable model we can get our hands on! Specific models include variants on many well-known approaches, such as the Bass Diffusion Model, the Theta Model, Logistic models, bsts, STL, Holt-Winters and other Exponential Smoothing models, Seasonal and other ARIMA-based models, Year-over-Year growth models, custom models, and more.” – Eric Tassone and Farzan Rohani say.
By averaging the forecast of many models that perform differently in different time series situations, they achieved better predictability than they could with a single model. While some models work better with their specific non-stationary data, others shine in theirs. The average that they yield acts like an expert opinion and turns out to be very precise.
Source: Our quest for robust time series, Eric Tassone and Farzan Rohani, 2017, Forecast procedure in Google
However, the authors of the post note that this approach may be the best one for their specific situation. Google services stretch across many countries where different factors like electricity, internet speed, user working cycles are adding too many non-stationary patterns. So, if you aren’t operating with a multitude of locations or a large set of varying data sources, ensemble models may not be for you. But if you track time series patterns across countries or business units in different regions it might be the best fit.