Subscribe to DSC Newsletter

Contributed by David Comfort and Paul. They took NYC Data Science Academy 12 week full time Data Science Bootcamp program between Sept 23 to Dec 18, 2015. The post was based on their fourth class project(due at 8th week of the program).


As part of a Kaggle competition, we were challenged by Rossmann, the second largest chain of German drug stores, to predict the daily sales for 6 weeks into the future for more than 1,000 stores.  Exploratory data analysis revealed several novel features, including spikes in sales prior to, and preceding store refurbishment. We also engineered several novel features by the inclusion of external data including Google Trends, macroeconomic data, as well as weather data. We then used H20, a fast, scalable parallel-processing engine for machine learning, to build predictive models utilizing random forests, gradient boosting machines, as well as deep learning. Lastly, we combined these models using different ensemble methods to obtain better predictive performance. Training data was provided for 1,115 Rossmann stores from January 1st 2013 through July 31st 2015 .The task was to forecast 6 weeks (August 1st 2015 through September 17th 2015) of sales for 856 of the Rossmann stores identified within the testing data.

Data Sets

  • TRAIN.CSV - historical data including sales
  • TEST.CSV - historical data excluding sales
  • SAMPLE_SUBMISSION.CSV - sample submission file in the correct format
  • STORE.CSV - supplemental information describing each of the stores

Data Fields

  • Id - represents a (Store, Date) duple within the test set
  • Store - Unique Id for each store
  • Sales - The turnover for any given day (variable to be predicted)
  • Customers - The number of customers on a given day
  • Open - An indicator for whether the store was open: 0 = closed, 1 = open
  • StateHoliday - Indicates a state holiday
    • Normally all stores, with few exceptions, are closed on state holidays.
    • All schools are closed on public holidays and weekends.
    • a = public holiday
    • b = Easter holiday
    • c = Christmas
    • 0 = None
  • SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools
  • StoreType - Differentiates between 4 different store models:
    • a, b, c, d
  • Assortment - Describes an assortment level:
    • a = basic, b = extra, c = extended
  • CompetitionDistance - Distance in meters to the nearest competitor store
  • CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened
  • Promo - Indicates whether a store is running a promo on that day
  • Promo2 - Continuing and consecutive promotion for some stores:
    • 0 = store is not participating
    • 1 = store is participating
  • Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2
  • PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew.
    • “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

Exploratory Data Analysis

Exploratory data analysis was performed in ipython notebook and R. Findings discovered throughout the EDA process were all addressed during data cleaning and feature engineering. Courtesy of fellow Kaggler Paul Shearer for creating a fantastic dygraph to display all data. Picture1

Data Cleaning

  1. Impute Open = 1 for missing Open test dataset
    • Special case found during EDA. Store 622 has several missing dates which are all weekdays with sales recorded.
  2. Set Open = 0 when Sales = 0 OR Customers = 0
  3. Standardize StateHoliday due to the use of character 0 and integer 0
  4. Separate date column into year, month and day. Also convert Date column to type ‘date’ and extract:
    • day_of_year
    • week_of_year
    • quarter
    • month_start
    • month_end
    • quarter_start
    • quarter_end
  5. Remove observations where stores are closed. These values can be hard coded after prediction since no sales can occur with a closed store.
  6. Set store as factor.
  7. Merge store dataset with train and test
  8. Stores.csv contained an abundance of missing values. Machine learning methods were chosen with this in mind. The following methods that were implemented handle missing values with ease. Their methods are described below.
    • Distributed Random Forest and Gradient Boosting Machine treat missing (NA) factor levels as the smallest value present (left-most in the bins), which can go left or right for any split, and unseen factor levels (the case here) to always go left in any split.
    • Deep Learning by default makes an extra input neuron for missing and unseen categorical levels which can remain untrained if there were no such instances in the training data, leading to some random contribution to the next layer during testing.
  9. Experimenting with variables as factors:
    • We experimented with setting features to factors in order to see their effects upon the MSE and residual errors. We should note that H20 can deal with large numbers of factors and the categorical data does not need to be one-hot encoded. H2O does not expand categorical data into dummy variables, but instead uses a bitset to determine which categorical levels go left or right on each split.
    • H2O is also more accurate than R/Python. I think the reason for that is dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset where the link between the dummies belonging to the same original variable is lost. This is by the way how the R package should work if the number of categories is small (but not in our case).
      • Train a model on data that has a categorical predictor (column) with levels B,C,D (and no other levels). Let’s call these levels the “training set domain”: {B,C,D}
      • During scoring, a test set has only rows with levels A,C,E for that column, the “test set domain”: {A,C,E}
      • For scoring, we construct the joint “scoring domain”: {B,C,D,A,E}, which is the training domain with the extra test set domain entries appended.
      • Each model can handle these extra levels {A,E} during scoring separately.The way H2O deals with categories is not only more proper and gets better AUC, but it is makes it faster and more memory efficient. See Categorical variables with random forest for more information.In addition, most machine learning tools break when you try to predict with a new level for a categorical input that was not present in the training set. However, H2O is able to handle such a situation. Here’s an example of how this works:
    • See prediction with categorical variable with a new level for more information.
  10. We use a log transformation for the sales in order not to be as sensitive to high sales. A decent rule of thumb is if the data spans an order of magnitude, then consider using a log transform.

Feature Engineering

  1. Promotion First Day and Promotion Second Day
  2. DayBeforeClosed
  3. DayAfterClosed
  4. Store Open on Sunday
  5. Closed
  6. Refurbishment period
    • EDA lead to a discovering a spike in sales before and after a period of extended close indicating a clearance and grand opening sale
    • One thing to note was that Some stores in the dataset were temporarily closed for refurbishment.
  7. DaysBeforeRefurb
  8. DaysAfterRefurb
  9. Competition Feature
    • We create this new feature by taking the square root of the difference between the maximum distance of a competitor store among all the stores and the distance of a competitor store for an individual store, times the time since a competitor opened:
# Competition Featuretrain$Competition <- 
  (sqrt(max(train$CompetitionDistance, na.rm = TRUE) - 
train$CompetitionDistance)) *
(((train$year - train$CompetitionOpenSinceYear) * 12) -

test$Competition <-
(sqrt(max(test$CompetitionDistance, na.rm = TRUE) -
(((test$year - test$CompetitionOpenSinceYear) * 12) -

Open Data Sources

  • German States derived from StateHoliday
  • German State Weather
  • Google Trends
  • Export into CSV

Introduction to H20

H2O is an open source math & machine learning engine for big data that brings distribution and parallelism to powerful algorithms while keeping the widely used languages such as R, Spark, Python, Java, and JSON as an API. Using in-memory compression, H2O handles billions of data rows in-memory, even with a small cluster. H2O includes many common machine learning algorithms, such as generalized linear modeling (linear regression, logistic regression, etc.), Na ̈ıve Bayes, principal components analysis, k-means clustering, and others. H2O also implements best-in-class algorithms at scale, such as distributed random forest, gradient boosting, and deep learning.

H2O Process

  • Have correct version of Java installed
  • Start H20 instance
  • Set features to factors (and test)
  • Create validation set
  • Tune Parameters for each model (manual or grid search)
  • Use model to make predictions on test set
  • Iterate

Load Libraries into R


Initialize H2O Cluster With All Available Threads

One should use h2o.shutdown() if changing parameters below.  Also, setting assertion = FALSE seems to help with stability of H20.
h2o.init(nthreads=-1,max_mem_size='8G', assertion = FALSE)## Successfully connected to 
## R is connected to the H2O cluster:
## H2O cluster uptime: 16 hours 46 minutes
## H2O cluster version:
## H2O cluster name: H2O_started_from_R_2015_ukz280
## H2O cluster total nodes: 1
## H2O cluster total memory: 7.98 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## IP Address:
## Port : 54321
## Session ID: _sid_ac1406fd65438164da7936d76cfe44b2
## Key Count : 0

Read Test and Train Data

Data.table was used as a means of reading in data due to its increased efficiency for data frame manipulation over dplyr.
train <- fread("KaggleProject/data/train_states_R_v8.csv",                stringsAsFactors = T)
test  <- fread("KaggleProject/data/test_states_R_v8.csv",
stringsAsFactors = T, showProgress=FALSE)
store <- fread("input/store.csv",
stringsAsFactors = T)

Create Stratified Folds For Cross-Validation

The Rossmann dataset is a “pooled-repeated measures” dataset, whereby multiple observations from different stores are grouped together. Hence, the internal cross-validation has to be done in an “honest” manner, i.e., all the observations from one store must belong to a single fold. Otherwise, it can lead to overfitting. Creating stratified folds for cross-validation can be easily achieved by utilizing the createFolds method from the Caret package in R. Since the stores dataset is a list of each store with one store per row, we can create the folds in the stores dataset prior to merging this dataset with the train and test datasets.
folds <- createFolds(factor(store$Store), k = 10, list = FALSE)store$fold <- folds
ddply(store, 'fold', summarise, prop=mean(store$fold)/10)
## fold prop
## 1 1 0.5598206
## 2 2 0.5598206
## 3 3 0.5598206
## 4 4 0.5598206
## 5 5 0.5598206
## 6 6 0.5598206
## 7 7 0.5598206
## 8 8 0.5598206
## 9 9 0.5598206
## 10 10 0.5598206

H2O Training and Validation Test Sets

We simply split the training set by date; the training set is simply all rows prior to June 2015 and the validation set are rows June 2015 and on. We then check the dimensions of the training and test sets.
trainHex<-as.h2o(train[year <2015 | month <6,],          destination_frame = "trainHex")
validHex<-as.h2o(train[year == 2015 & month >= 6,],
destination_frame = "validHex")


## [1] 785727 37
## [1] 58611 37

Feature Set For Training

We exclude the Store Id, Date, Sales, LogSales and Customers. The Store Id is only used as a identifier; we split the date into different components (day, month, and year); the Sales and Log of the Sales are what we predicting; and the we are not given the customers in the test and, hence, cannot use it as a feature in the training set.
features<-names(train)[!(names(train) %in%                            c("Id","Date","Sales","logSales", 
                             "Customers", "Closed", "fold"))]
## [1] "Store" "DayOfWeek"
## [3] "Open" "Promo"
## [5] "StateHoliday" "SchoolHoliday"
## [7] "year" "month"
## [9] "day" "day_of_year"
## [11] "week_of_year" "PromoFirstDate"
## [13] "PromoSecondDate" "DayBeforeClosed"
## [15] "DayAfterClosed" "SundayStore"
## [17] "DayBeforeRefurb" "DayAfterRefurb"
## [19] "DaysBeforeRefurb" "DaysAfterRefurb"
## [21] "State" "StoreType"
## [23] "Assortment" "CompetitionDistance"
## [25] "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear"
## [27] "Promo2" "Promo2SinceWeek"
## [29] "Promo2SinceYear" "PromoInterval"
## [31] "Competition"

Random Forest


  • Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other trees
  • Bootstrap aggregation (bagging)
  • Fits many trees against different samples of the data and average together


  • Combine multiple decision trees, each fit to a random sample of the original data
  • Random samples
  • Rows / Columns
  • Reduce variance wtih minimal increase in bias


  • Ease of use with limited well-established default parameters
  • Robust
  • Competitive accuracy for most data sets
  • Random forest combines trees and hence incorporates most of the advantages of trees like handling missing values in variable, suiting for both classification and regression, handling highly non-linear interactions and classification boundaries.
  • In addition, Random Forest gives built-in estimates of accuracy, gives automatic variable selection. variable importance, handles wide data – data with more predictors than observations and works well off the shelf – needs no tuning, can get results very quickly. * The runtimes are quite fast, and they are able to deal with unbalanced and missing data.


  • Slow to score
  • Lack of transparency
  • When used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy. However, the best test of any algorithm is determined by how well it works upon a particular data set.

Train a Random Forest Using H20

We should note that we used the stratified folds created in the step above, but H20 also has internal cross-validation (by setting nfolds to the desired number of cross-validation folds). RF A random forest is an ensemble of decision trees that will output a prediction value. An ensemble model combines the results from different models. A Random Forest is combination of classification and regression. The result from an ensemble model is usually better than the result from one of the individual models. In Random Forest, each decision tree is constructed by using a random subset of the training data that has predictors with known response. After you have trained your forest, you can then pass each test row through it, in order to output a prediction. The goal is to predict the response when it’s unknown. The response can be categorical(classification) or continuous (regression). In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random forest takes the notion of decision trees to the next level by combining trees. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.
rfHex <- h2o.randomForest(x=features,                          y="logSales", 
mtries = -1, # default
sample_rate = 0.632, # default
ntrees = 100,
max_depth = 30,
nbins_cats = 1115, ## allow it to fit store ID
nfolds = 0,
seed = 12345678 #Seed for random numbers (affects sampling)

Key Parameters for Random Forests

The key parameters for the Random Forest model on an H2O frame include:
  • x: A vector containing the names or indices of the predictor variables to use in building the GBM model. We have defined x to be the features to consider in building our model.
  • y: The name or index of the response variable. In our case, the log of the Sales.
  • training_frame: An H2O Frame object containing the variables in the model. In our case, this was the subset of the train dataset defined above.
  • model_id: The unique id assigned to the resulting model.
  • validation_frame: An H2O Frame object containing the variables in the model. In our case, this was the subset of the train dataset defined above.
  • mtries: At each iteration, a randomly chosen subset of the features in the training data is selected and evaluated to define the optimal split of that subset. Mtries specifies the number of features to be selected from the whole set. If set to -1, defaults to p/3 for regression, where p is the number of predictors.
  • sample_rate: The sampling rate at each split.
  • ntrees: A nonnegative integer that determines the number of trees to grow.
  • max_depth: Maximum depth to grow the tree. A user-defined tuning parameter for controlling model complexity (by number of edges). Depth is the longest path from root to the furthest leaf. Maximum depth also specifies the maximum number of interactions that can be accounted for by the model.
  • nbins_cats: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. In our case, we set it equal to the number of stores we are trying to model (1,115).
  • nfolds: Number of folds for cross-validation. Since we are using stratified cross-validation, we have set this to 0.
  • fold_column: Column with cross-validation fold index assignment per observation, which we have set to fold, which was created in the “Create stratified folds for cross-validation” step above.

Benchmarking H20 and Random Forests

x-plot-timeAs noted before, H20 can use all the cores on a machine and hence should run substantially faster than if we used another random forest package in R. Szilard Pafka recently benchmarked several machine learning tools for scalability, speed and accuracy, including H2O. Pafka concluded that the H2O implementation of random forests is “fast, memory efficient and uses all cores. It deals with categorical variables automatically. It is also more accurate than R/Python, which may be because of dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset (where the link between the dummies belonging to the same original variable is lost).”

- Benchmarking Random Forest Implementations

For more about information about this benchmarking study, see Simple/limited/incomplete benchmark for scalability, speed and accu... and a video presentation, Szilard Pafka: Benchmarking Machine Learning Tools for Scalability,....

Retrain Random Forest On Training Set

We retrain on the whole training set (without the validation set or the cross-validation set) once we have tuned the parameters. Although not shown here, we extensively tested the different paramters for random forests as well as performed feature selection.
# Use the whole train dataset.trainHex<-as.h2o(train)

# Run Random Forest Model
rfHex <- h2o.randomForest(x=features,
mtries = -1, # default
sample_rate = 0.632, # default
ntrees = 100,
max_depth = 30,
nbins_cats = 1115, ## allow it to fit store ID
nfolds = 0,
seed = 12345678 #Seed for random numbers (affects sampling)

# Model Summary and Variable Importanc
varimps = data.frame(h2o.varimp(rfHex))

#Load test dataset into H2O from R

#Get predictions out; predicts in H2O, gets them into R

# Return the predictions to the original scale of the Sales data
pred <- expm1(predictions[,1])
submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file
write.csv(submission, "../../data/H2O_Random_Forest_v47.csv",row.names=F)

Gradient Boosted Models

We also utilized gradient boosted models (GBM) utilizing H2O as well.


  • Average an ensemble of weakly predicting (small) trees where each tree “adjusts” to the “mistakes” of the preceding trees.
  • Boosting
  • Fits consecutive trees where each solves for the net error of the prior trees.


  • Boosting: ensemble of weak learners (the notion of “weak” is being challenged in practice)
  • Fits consecutive trees where each solves for the net loss of the prior trees
  • Results of new trees are applied partially to the entire solution.


  • Often best possible model
  • Robust
  • Directly optimizes cost function ##### Weaknesses
  • Overfits
  • Need to find proper stopping point
  • Sensitive to noise and extreme values
  • Several hyper-parameters
  • Lack of transparency

Important components:

  • Number of trees
  • Maximum depth of tree
  • Learning rate (shrinkage parameter), where smaller learning rates tend to require larger number of tree and vice versa.

Gradient Boosting Models in Detail

A GBM is an ensemble of either regression or classification tree models. Both are forward-learning ensemble methods that obtain predictive results using gradually improved estimations. Boosting is a flexible nonlinear regression procedure that helps improve the accuracy of trees. Weak classification algorithms are sequentially applied to the incrementally changed data to create a series of decision trees, producing an ensemble of weak prediction models. While boosting trees increases their accuracy, it also decreases speed and user interpretability. The gradient boosting method generalizes tree boosting to minimize these drawbacks. For more information, see Gradient Boosted Models with H2O

H2O’s GBM Functionalities

  • Supervised learning for regression and classification tasks
  • Distributed and parallelized computation on either a single node or a multi-node cluster
  • Fast and memory-e cient Java implementations of the algorithms
  • The ability to run H2O from R, Python, Scala, or the intuitive web UI (Flow)
  • automatic early stopping based on convergence of user-specified metrics to user-specified relative tolerance
  • stochastic gradient boosting with column and row sampling for better generalization
  • support for exponential families (Poisson, Gamma, Tweedie) and loss functions in addition to binomial (Bernoulli), Gaussian and multinomial distributions
  • grid search for hyperparameter optimization and model selection
  • modelexportinplainJavacodefordeploymentinproductionenvironments
  • additional parameters for model tuning (for a complete listing of parame- ters, refer to the Model Parameters section.)

Key Parameters for GBM

There are three primary paramters, or knobs, to adjust in order optimize GBMs.
  1. Adding trees will help. The default is 50.
  2. Increasing the learning rate will also help. The contribution of each tree will be stronger, so the model will move further away from the overall mean.
  3. Increasing the depth will help. This is the parameter that is the least straightforward. Tuning trees and learning rate both have direct impact that is easy to understand. Changing the depth means you are adjusting the “weakness” of each learner. Adding depth makes each tree fit the data closer.

Retrain GBM On Training Set

# Train a GBM ModelgbmHex <- h2o.gbm(x=features,
sample_rate = 0.5,
col_sample_rate = 0.5,
max_depth = 20,
seed = 12345678, #Seed for random numbers (affects sampling)
ntrees = 250,
validation_frame=validHex # validation set

#Get a summary of the model and variable importance
varimps = data.frame(h2o.varimp(gbmHex))

# Get predictions out; predicts in H2O, gets them into R

# Return the predictions to the original scale of the Sales data
pred <- expm1(predictions[,1])
submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file
write.csv(submission, "../../data/H2O_GBM_v30.csv",row.names=F)


Although we scored in top 15% of competition involving over 3,000 teams, the most valuable lesson learned was the understanding of accuracy vs interpretation trade off. By achieving the accuracy needed to score well for this Kaggle competition, we traded off interpretation that may have been needed to explain to management if this were to have been an actual business project. As a way of exploring this trade off, one of the first methods chosen was a multiple linear regression in order to gain a greater understanding of the characteristics for each feature. We achieved approximately 25% error using this simpler method. This may be an adequate prediction if market trends needed for broad scale decision making was the goal however being a Kaggle competition where accuracy was priority, this was not a method to be explored further.

Views: 2285


You need to be a member of Data Science Central to add comments!

Join Data Science Central


  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service