Sales prediction is an important part of modern business intelligence. First approaches one can apply to predict sales time series are such conventional methods of forecasting as ARIMA and Holt-Winters. But there are several challenges while using these methods. They are: multilevel daily/weekly/monthly/yearly seasonality, many exogenous factors which impact sales, complex trends in different time periods. In such cases, it is not easy to apply conventional methods. Of course, there is the implementation of SARIMA for time series with seasonality, and SARIMAX for time series with seasonality and exogenous factors. But in these implementations, we can deal with simple seasonality with one time period and exogenous factors which are treated as covariates for residuals. Also, in these implementations, it is not easy to apply different types of regularizations to avoid overfitting or take into account feature interaction. The other main problem is that in some real cases of store sales we do not have an efficient number of historical time series values, e.g. in case when a store has been opened recently or has been acquired recently by a store chain. The lack of historical sales values can appear in case when a new product has been launched recently. How can we predict sales in such cases? So, in general, sales prediction can be a hard complex problem. To get insights and to find new approaches, some companies propose such type of problems for data science competitions, e.g. at Kaggle. The company Grupo Bimbo organized Kaggle competition Grupo Bimbo Inventory Demand. In this competition, Grupo Bimbo invited Kagglers to develop a model to forecast accurately the inventory demand based on historical sales data. I had a pleasure to be a teammate of a great team “The Slippery Appraisals” which won this competition among nearly two thousand teams. We proposed the best scored solution for sales prediction in more than 800,000 stores for more than 1000 products. Our first place solution is **here** . The simple R script which based on the Xboost classifier can be found **here** . To built our final multilevel model, we exploited AWS server with 128 cores and 2Tb RAM. For our solution, we used a multilevel model, which consists of three levels and has the following structure:

We built a lot of models on the 1st level. The training method of most 1st level models was XGBoost. On the second level, we used a stacking approach when the results from the first level classifiers were treated as the features for the classifiers on the second level. For the second level, we used ExtraTrees classifier, the linear model from Python scikit-learn and Neural Networks. On the third level, we applied a weighted average to the second level results. The most important features are based on the lags of the target variable grouped by factors and their combinations, aggregated features (min, max, mean, sum) of target variable grouped by factors and their combinations, frequency features of factors variables. Our winner solution may seem to be too complicated, but our goal was to win the competition and even a small improvement in forecasting score required essential numbers of machine learning models in the final ensemble. Real business cases with a sufficient accuracy can be simpler.

We are going to consider several simple cases of approaches in the sales times series forecasting. For our analysis, we used store sales historical data from Kaggle competition “Rossmann Store Sales”. These data represent the sales time series of Rossmann stores. For example, sales time series in stores chain are shown on the figure:

Before building of the predictive model, we can explore the data and conduct a descriptive analytics. The next figures show different distributions and dependencies of sales data features:

This descriptive analytics plots can give us some insights how to build predictive models, what features to select and how to construct new features to get more accurate prediction. The next figure shows a typical sales time series for an arbitrary store:

To compare different forecasting approaches, we used two last months of the historical data as the validation data for accuracy scoring using root mean squared error (RMSE). We consider sales in the natural logarithmic scale. The next figure shows forecasting results and RMSE scores for different approaches applied to time series under consideration:

Let us consider a linear regression model. As the regularization, we used Lasso. The next figures show the time series with forecasting results and top coefficients:

The lower and upper bounds were calculated on validation sets. Lasso regularization allows us to avoid overfitting and select the most important features.

Having different predictive models with different sets of features, it is useful to combine all these results into one. There are two main approaches for such a purpose – bagging and stacking. Bagging is a simple approach when we make weighted blending of different model predictions. Such models use different types of classifiers with different sets of features and meta parameters, then forecasting errors will have a weak correlation and they will compensate each other under the weighted blending. The less is the error correlation of model results, the more precise forecasting result we will receive. Let us consider the stacking technic for building ensemble of predictive models. In such an approach the results of predictions on the validation set are treated as input regressors for the next level models. As the next level model, we can consider a linear model or another type of a classifier, e.g. Random Forest classifier or Neural Network. The target values of a validation set on this level are treated as a target variable for classifier. It is important to mention that in case of time series prediction, we cannot use conventional cross validation approach, we have to split a historical data set on the training set and validation set by using time splitting, so the training data will lie in the first time period and the validation set – in the next one. The next figure shows the time series forecasting on the validation sets obtained using different models:

Predictions on the validation sets are treated as regressors for the linear model with Lasso regularization. The next figure shows the results obtained on the second-level linear regularized model:

Only two models from the first level (gradientBoosting and ExtraTree) have non zero coefficients for their results. For other cases of sales datasets, the results can be different when the other models can play more essential role in the forecasting.

Let us consider the case when we do not have enough historical sales values for some store or some product, e.g. it should be a new store or recently launched product. The main idea of the approach in this case is that similar stores or products will have similar sales behavior. To get good results, it is important to construct an appropriate feature set. So, we need to find or construct such features which will characterize stores and products from different points of view. As a result, if we have a new store or product but it is similar to some items in the historical data, then the classifier will be able to predict the sales forecasting for this store or product.

Using validation set, we can calculate the probabilistic density distribution for sales in case, when we need to aggregate sales for some future time period in order to know how much products we can sell during this period. The next figure shows the distribution of aggregated sales:

When we need to take into account the non-Gaussian type of sales distribution, we can exploit the Bayesian inference approach using stochastic sampling algorithms such as MCMC. Having the distribution of aggregated sales, we can make risk assessment by calculating value at risk which can be regarded as 5% quantile. Sales distribution and risk estimations enable us to work out the stochastic optimization of business processes related to products sales, supplying and logistic. For this thing we need to create an objective function which includes sales forecasting and business expenses. For such type of optimization, the mixed integer mathematic programing algorithms can be used.

If we expect that some product should be sold for some time period, then having the distribution of aggregated forecasted sales and marginal price of product, we can find the optimal amount of the product which should be supplied for this time period. We also need to compose a loss function which will describe the losses in case when the product has not been sold for this time period. In the simple case of perishable products, we can assume that after this period the shelf time of the product is expired and unsold products are lost. Under this assumption, we can receive the profit versus the amount of supplied products:

The maximum of this curve defines the optimal amount of products which should be supplied.

Business processes are constantly evolving and some new patterns of these processes may not be presented in the historical data. To get more precise forecasting, it can be corrected by experts’ opinions, e.g. experts can give their predictions about sales trends. It should be pessimistic, realistic and optimistic corrections with appropriate probabilities. The next figures show the probability distribution function for aggregated forecasted sales with experts’ correction and dependence of profit vs supply amount of products:

Let us consider the case when we have sales of the product in different stores with different prices. We also have prices of some supplementary product and the price of a similar product from the competitors. The sales of this product are being conducted under two different types of advertising. The next figure shows the pair plots of sales features:

For linear regression model we received the following coefficients for sales features:

One can see that the rise of product price decreases sales and the rise of competitors’ product price increases sales. To describe these effects, we can consider the price elasticity of demand and cross elasticity of demand according to competitors’ prices. The next figure shows the dependence of relative price to relative product sales:

The next figure shows the dependence of relative competitor’s product price to relative product sales:

Let us consider the price optimization. Having marginal product price, we can calculate the profit of a product. On the next figures, the joint distribution of the amount of sold products and price is shown:

The next figure shows the joint distribution of profit and price:

Comparing these two figures, one can see that the largest product sales can give the smallest profit. To find an optimal price, we need to study the dependence of profit from the price. In the simplest way, we can receive the profit vs price dependence for this dataset, shown on the figure:

The maximum of the curve corresponds to an optimal price. Under more advanced analysis, the optimal price of the product can be different for different conditions, e.g. for different types of stores.

Let us consider the descriptive analytics and linear model with Lasso regularization for well-known dataset for Orange Juice sales. The distributions and dependencies of sold units amount and prices are shown on the next figures:

The heatmap shows the level of correlation between different features:

For the study of exogenous features impact, the linear model with Lasso regularizatiojn has been built. The next figure shows the dependence RMSE from Lasso regularization parameter on the validation set:

The next figure shows the coefficients of the features in the linear model:

The coefficients of the linear model with feature interaction terms are shown on the following figure:

Sometimes we need to forecast not only more probable values of sales but also their distribution. Especially we need it in the risk analysis for assessing different risks related to sales dynamics. To find distributions of model parameters, a Bayesian inference approach can be used. Let us consider the Bayesian approach for sales forecasting. The Bayesian inference is a suitable approach in cases when we have a small number of the historical data. It allows us to combine the historical data and experts’ opinions which can be expressed via prior distribution of parameters. For Bayesian inference case study, we take such features as promo, seasonality factors (week day, month day, month of year). For Bayesian inference one can use JAGS an STAN sampling software. The next figure shows the boxplots for forecasted sale distributions (blue points denote historical sales):

The next figure shows the density of distributions of some regression coefficients for a chosen arbitrary store:

Sales time series can have outliers and it is important to take into account this fact using heavy tails distributions instead of Gaussian distribution. To define the optimal amount of product which should be supplied for sales in the store during a month, we need to define income, losses and profit. In this case study, we consider the profit as the following :

Profit = SoldAmount*(Price – MarginalPrice) – UnsoldAmount* MarginalPrice

The following figure shows the calculated distribution of profit for different amount of products which were supplied to the store:

The next figure shows the dependence of the profit on the amount of supplied products:

The maximum of the curve corresponds to optimal supply amount. As the case study shows, the use of Bayesian approach allows us to model stochastic dependencies between different factors of sales time series and receive the distributions for model parameters. Having these distributions, we can calculate the optimal amount of products which should be sold for a defined time period. Such an approach allows us to avoid large losses caused by unsold perishable products when the extra charge is small.

In our case studies, we showed different modern approaches for sales predictive analytics. The keystone of our approach is the historical data structure. Relying on it, we can select and construct new features, choose different technics and methods for the analysis. Creating ensembles of different models can give us better accuracy in comparison with single algorithms. The probabilistic approach for time series modeling is important in the risk assessment problems.