Contributed by David Comfort and Paul. They took *NYC Data Science Academy 12 week full time Data Science Bootcamp* program between Sept 23 to Dec 18, 2015. The post was based on their* fourth class project(due at 8th week of the program).*

As part of a Kaggle competition, we were challenged by Rossmann, the second largest chain of German drug stores, to predict the daily sales for 6 weeks into the future for more than 1,000 stores. Exploratory data analysis revealed several novel features, including spikes in sales prior to, and preceding store refurbishment. We also engineered several novel features by the inclusion of external data including Google Trends, macroeconomic data, as well as weather data. We then used H20, a fast, scalable parallel-processing engine for machine learning, to build predictive models utilizing random forests, gradient boosting machines, as well as deep learning. Lastly, we combined these models using different ensemble methods to obtain better predictive performance. Training data was provided for 1,115 Rossmann stores from January 1st 2013 through July 31st 2015 .The task was to forecast 6 weeks (August 1st 2015 through September 17th 2015) of sales for 856 of the Rossmann stores identified within the testing data.

- TRAIN.CSV - historical data including sales
- TEST.CSV - historical data excluding sales
- SAMPLE_SUBMISSION.CSV - sample submission file in the correct format
- STORE.CSV - supplemental information describing each of the stores

- Id - represents a (Store, Date) duple within the test set
- Store - Unique Id for each store
- Sales - The turnover for any given day (variable to be predicted)
- Customers - The number of customers on a given day
- Open - An indicator for whether the store was open: 0 = closed, 1 = open
- StateHoliday - Indicates a state holiday
- Normally all stores, with few exceptions, are closed on state holidays.
- All schools are closed on public holidays and weekends.
- a = public holiday
- b = Easter holiday
- c = Christmas
- 0 = None

- SchoolHoliday - Indicates if the (Store, Date) was affected by the closure of public schools
- StoreType - Differentiates between 4 different store models:
- a, b, c, d

- Assortment - Describes an assortment level:
- a = basic, b = extra, c = extended

- CompetitionDistance - Distance in meters to the nearest competitor store
- CompetitionOpenSince [Month/Year] - Approximate year and month of the time the nearest competitor was opened
- Promo - Indicates whether a store is running a promo on that day
- Promo2 - Continuing and consecutive promotion for some stores:
- 0 = store is not participating
- 1 = store is participating

- Promo2Since [Year/Week] - Describes the year and calendar week when the store started participating in Promo2
- PromoInterval - Describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew.
- “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

Exploratory data analysis was performed in ipython notebook and R. Findings discovered throughout the EDA process were all addressed during data cleaning and feature engineering. Courtesy of fellow Kaggler Paul Shearer for creating a fantastic dygraph to display all data.

- Impute Open = 1 for missing Open test dataset
- Special case found during EDA. Store 622 has several missing dates which are all weekdays with sales recorded.

- Set Open = 0 when Sales = 0 OR Customers = 0
- Standardize StateHoliday due to the use of character 0 and integer 0
- Separate date column into year, month and day. Also convert Date column to type ‘date’ and extract:
- day_of_year
- week_of_year
- quarter
- month_start
- month_end
- quarter_start
- quarter_end

- Remove observations where stores are closed. These values can be hard coded after prediction since no sales can occur with a closed store.
- Set store as factor.
- Merge store dataset with train and test
- Stores.csv contained an abundance of missing values. Machine learning methods were chosen with this in mind. The following methods that were implemented handle missing values with ease. Their methods are described below.
- Distributed Random Forest and Gradient Boosting Machine treat missing (NA) factor levels as the smallest value present (left-most in the bins), which can go left or right for any split, and unseen factor levels (the case here) to always go left in any split.
- Deep Learning by default makes an extra input neuron for missing and unseen categorical levels which can remain untrained if there were no such instances in the training data, leading to some random contribution to the next layer during testing.

- Experimenting with variables as factors:
- We experimented with setting features to factors in order to see their effects upon the MSE and residual errors. We should note that H20 can deal with large numbers of factors and the categorical data does not need to be one-hot encoded. H2O does not expand categorical data into dummy variables, but instead uses a bitset to determine which categorical levels go left or right on each split.
- H2O is also more accurate than R/Python. I think the reason for that is dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset where the link between the dummies belonging to the same original variable is lost. This is by the way how the R package should work if the number of categories is small (but not in our case).
- Train a model on data that has a categorical predictor (column) with levels B,C,D (and no other levels). Let’s call these levels the “training set domain”: {B,C,D}
- During scoring, a test set has only rows with levels A,C,E for that column, the “test set domain”: {A,C,E}
- For scoring, we construct the joint “scoring domain”: {B,C,D,A,E}, which is the training domain with the extra test set domain entries appended.
- Each model can handle these extra levels {A,E} during scoring separately.The way H2O deals with categories is not only more proper and gets better AUC, but it is makes it faster and more memory efficient. See Categorical variables with random forest for more information.In addition, most machine learning tools break when you try to predict with a new level for a categorical input that was not present in the training set. However, H2O is able to handle such a situation. Here’s an example of how this works:

- See prediction with categorical variable with a new level for more information.

- We use a log transformation for the sales in order not to be as sensitive to high sales. A decent rule of thumb is if the data spans an order of magnitude, then consider using a log transform.

- Promotion First Day and Promotion Second Day
- DayBeforeClosed
- DayAfterClosed
- Store Open on Sunday
- Closed
- Refurbishment period
- EDA lead to a discovering a spike in sales before and after a period of extended close indicating a clearance and grand opening sale
- One thing to note was that Some stores in the dataset were temporarily closed for refurbishment.

- DaysBeforeRefurb
- DaysAfterRefurb
- Competition Feature
- We create this new feature by taking the square root of the difference between the maximum distance of a competitor store among all the stores and the distance of a competitor store for an individual store, times the time since a competitor opened:

# Competition Featuretrain$Competition <- (sqrt(max(train$CompetitionDistance, na.rm = TRUE) -

train$CompetitionDistance)) *

(((train$year - train$CompetitionOpenSinceYear) * 12) -

(train$CompetitionOpenSinceMonth-train$month))

test$Competition <-

(sqrt(max(test$CompetitionDistance, na.rm = TRUE) -

test$CompetitionDistance))*

(((test$year - test$CompetitionOpenSinceYear) * 12) -

(test$CompetitionOpenSinceMonth-test$month))

- German States derived from StateHoliday
- German State Weather
- Google Trends
- Export into CSV

- Have correct version of Java installed
- Start H20 instance
- Set features to factors (and test)
- Create validation set
- Tune Parameters for each model (manual or grid search)
- Use model to make predictions on test set
- Iterate

library(caret)library(data.table) library(h2o)

library(plyr)

`h2o.shutdown()`

if changing parameters below. Also, setting `assertion = FALSE`

seems to help with stability of H20.h2o.init(nthreads=-1,max_mem_size='8G', assertion = FALSE)## Successfully connected to http://127.0.0.1:54321/ ##

## R is connected to the H2O cluster:

## H2O cluster uptime: 16 hours 46 minutes

## H2O cluster version: 3.6.0.8

## H2O cluster name: H2O_started_from_R_2015_ukz280

## H2O cluster total nodes: 1

## H2O cluster total memory: 7.98 GB

## H2O cluster total cores: 8

## H2O cluster allowed cores: 8

## H2O cluster healthy: TRUE

## IP Address: 127.0.0.1

## Port : 54321

## Session ID: _sid_ac1406fd65438164da7936d76cfe44b2

## Key Count : 0

train <- fread("KaggleProject/data/train_states_R_v8.csv", stringsAsFactors = T) test <- fread("KaggleProject/data/test_states_R_v8.csv",

stringsAsFactors = T, showProgress=FALSE)

store <- fread("input/store.csv",

stringsAsFactors = T)

The Rossmann dataset is a “pooled-repeated measures” dataset, whereby multiple observations from different stores are grouped together. Hence, the internal cross-validation has to be done in an “honest” manner, i.e., all the observations from one store must belong to a single fold. Otherwise, it can lead to overfitting. Creating stratified folds for cross-validation can be easily achieved by utilizing the

`createFolds`

method from the Caret package in R. Since the stores dataset is a list of each store with one store per row, we can create the folds in the stores dataset prior to merging this dataset with the train and test datasets.folds <- createFolds(factor(store$Store), k = 10, list = FALSE)store$fold <- folds ddply(store, 'fold', summarise, prop=mean(store$fold)/10)

## fold prop

## 1 1 0.5598206

## 2 2 0.5598206

## 3 3 0.5598206

## 4 4 0.5598206

## 5 5 0.5598206

## 6 6 0.5598206

## 7 7 0.5598206

## 8 8 0.5598206

## 9 9 0.5598206

## 10 10 0.5598206

We simply split the training set by date; the training set is simply all rows prior to June 2015 and the validation set are rows June 2015 and on. We then check the dimensions of the training and test sets.

trainHex<-as.h2o(train[year <2015 | month <6,], destination_frame = "trainHex") validHex<-as.h2o(train[year == 2015 & month >= 6,],

destination_frame = "validHex")

dim(trainHex)

dim(validHex)

## [1] 785727 37

## [1] 58611 37

`Id`

, `Date`

, `Sales`

, `LogSales`

and `Customers`

. The Store `Id`

is only used as a identifier; we split the date into different components (day, month, and year); the Sales and Log of the Sales are what we predicting; and the we are not given the customers in the test and, hence, cannot use it as a feature in the training set.features<-names(train)[!(names(train) %in% c("Id","Date","Sales","logSales", "Customers", "Closed", "fold"))]

features

## [1] "Store" "DayOfWeek"

## [3] "Open" "Promo"

## [5] "StateHoliday" "SchoolHoliday"

## [7] "year" "month"

## [9] "day" "day_of_year"

## [11] "week_of_year" "PromoFirstDate"

## [13] "PromoSecondDate" "DayBeforeClosed"

## [15] "DayAfterClosed" "SundayStore"

## [17] "DayBeforeRefurb" "DayAfterRefurb"

## [19] "DaysBeforeRefurb" "DaysAfterRefurb"

## [21] "State" "StoreType"

## [23] "Assortment" "CompetitionDistance"

## [25] "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear"

## [27] "Promo2" "Promo2SinceWeek"

## [29] "Promo2SinceYear" "PromoInterval"

## [31] "Competition"

- Average an ensemble of weakly predicting (larger) trees where each tree is de-correlated from all other trees
- Bootstrap aggregation (bagging)
- Fits many trees against different samples of the data and average together

- Combine multiple decision trees, each fit to a random sample of the original data
- Random samples
- Rows / Columns
- Reduce variance wtih minimal increase in bias

- Ease of use with limited well-established default parameters
- Robust
- Competitive accuracy for most data sets
- Random forest combines trees and hence incorporates most of the advantages of trees like handling missing values in variable, suiting for both classification and regression, handling highly non-linear interactions and classification boundaries.
- In addition, Random Forest gives built-in estimates of accuracy, gives automatic variable selection. variable importance, handles wide data – data with more predictors than observations and works well off the shelf – needs no tuning, can get results very quickly. * The runtimes are quite fast, and they are able to deal with unbalanced and missing data.

- Slow to score
- Lack of transparency
- When used for regression they cannot predict beyond the range in the training data, and that they may over-fit data sets that are particularly noisy. However, the best test of any algorithm is determined by how well it works upon a particular data set.

`nfolds`

to the desired number of cross-validation folds). A random forest is an ensemble of decision trees that will output a prediction value. An ensemble model combines the results from different models. A Random Forest is combination of classification and regression. The result from an ensemble model is usually better than the result from one of the individual models. In Random Forest, each decision tree is constructed by using a random subset of the training data that has predictors with known response. After you have trained your forest, you can then pass each test row through it, in order to output a prediction. The goal is to predict the response when it’s unknown. The response can be categorical(classification) or continuous (regression). In a decision tree, an input is entered at the top and as it traverses down the tree the data gets bucketed into smaller and smaller sets. The random forest takes the notion of decision trees to the next level by combining trees. Thus, in ensemble terms, the trees are weak learners and the random forest is a strong learner.rfHex <- h2o.randomForest(x=features, y="logSales", model_id="introRF",

training_frame=trainHex,

validation_frame=validHex,

mtries = -1, # default

sample_rate = 0.632, # default

ntrees = 100,

max_depth = 30,

nbins_cats = 1115, ## allow it to fit store ID

nfolds = 0,

fold_column="fold",

seed = 12345678 #Seed for random numbers (affects sampling)

)

`x`

: A vector containing the names or indices of the predictor variables to use in building the GBM model. We have defined`x`

to be the features to consider in building our model.`y`

: The name or index of the response variable. In our case, the log of the Sales.`training_frame`

: An H2O Frame object containing the variables in the model. In our case, this was the subset of the`train`

dataset defined above.`model_id`

: The unique id assigned to the resulting model.`validation_frame`

: An H2O Frame object containing the variables in the model. In our case, this was the subset of the train dataset defined above.`mtries`

: At each iteration, a randomly chosen subset of the features in the training data is selected and evaluated to define the optimal split of that subset.`Mtries`

specifies the number of features to be selected from the whole set. If set to -1, defaults to p/3 for regression, where p is the number of predictors.`sample_rate`

: The sampling rate at each split.`ntrees`

: A nonnegative integer that determines the number of trees to grow.`max_depth`

: Maximum depth to grow the tree. A user-defined tuning parameter for controlling model complexity (by number of edges). Depth is the longest path from root to the furthest leaf. Maximum depth also specifies the maximum number of interactions that can be accounted for by the model.`nbins_cats`

: For categorical columns (factors), build a histogram of this many bins, then split at the best point. Higher values can lead to more overfitting. In our case, we set it equal to the number of stores we are trying to model (1,115).`nfolds`

: Number of folds for cross-validation. Since we are using stratified cross-validation, we have set this to 0.`fold_column`

: Column with cross-validation fold index assignment per observation, which we have set to`fold`

, which was created in the “Create stratified folds for cross-validation” step above.

- Benchmarking Random Forest Implementations

For more about information about this benchmarking study, see Simple/limited/incomplete benchmark for scalability, speed and accu... and a video presentation, Szilard Pafka: Benchmarking Machine Learning Tools for Scalability,....# Use the whole train dataset.trainHex<-as.h2o(train)

# Run Random Forest Model

rfHex <- h2o.randomForest(x=features,

y="logSales",

model_id="introRF",

training_frame=trainHex,

mtries = -1, # default

sample_rate = 0.632, # default

ntrees = 100,

max_depth = 30,

nbins_cats = 1115, ## allow it to fit store ID

nfolds = 0,

seed = 12345678 #Seed for random numbers (affects sampling)

)

# Model Summary and Variable Importanc

summary(rfHex)

varimps = data.frame(h2o.varimp(rfHex))

#Load test dataset into H2O from R

testHex<-as.h2o(test)

#Get predictions out; predicts in H2O, as.data.frame gets them into R

predictions<-as.data.frame(h2o.predict(rfHex,testHex))

# Return the predictions to the original scale of the Sales data

pred <- expm1(predictions[,1])

summary(pred)

submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file

write.csv(submission, "../../data/H2O_Random_Forest_v47.csv",row.names=F)

- Average an ensemble of weakly predicting (small) trees where each tree “adjusts” to the “mistakes” of the preceding trees.
- Boosting
- Fits consecutive trees where each solves for the net error of the prior trees.

- Boosting: ensemble of weak learners (the notion of “weak” is being challenged in practice)
- Fits consecutive trees where each solves for the net loss of the prior trees
- Results of new trees are applied partially to the entire solution.

- Often best possible model
- Robust
- Directly optimizes cost function ##### Weaknesses
- Overfits
- Need to find proper stopping point
- Sensitive to noise and extreme values
- Several hyper-parameters
- Lack of transparency

- Number of trees
- Maximum depth of tree
- Learning rate (shrinkage parameter), where smaller learning rates tend to require larger number of tree and vice versa.

- Supervised learning for regression and classification tasks
- Distributed and parallelized computation on either a single node or a multi-node cluster
- Fast and memory-e cient Java implementations of the algorithms
- The ability to run H2O from R, Python, Scala, or the intuitive web UI (Flow)
- automatic early stopping based on convergence of user-specified metrics to user-specified relative tolerance
- stochastic gradient boosting with column and row sampling for better generalization
- support for exponential families (Poisson, Gamma, Tweedie) and loss functions in addition to binomial (Bernoulli), Gaussian and multinomial distributions
- grid search for hyperparameter optimization and model selection
- modelexportinplainJavacodefordeploymentinproductionenvironments
- additional parameters for model tuning (for a complete listing of parame- ters, refer to the Model Parameters section.)

- Adding trees will help. The default is 50.
- Increasing the learning rate will also help. The contribution of each tree will be stronger, so the model will move further away from the overall mean.
- Increasing the depth will help. This is the parameter that is the least straightforward. Tuning trees and learning rate both have direct impact that is easy to understand. Changing the depth means you are adjusting the “weakness” of each learner. Adding depth makes each tree fit the data closer.

# Train a GBM ModelgbmHex <- h2o.gbm(x=features, y="logSales",

training_frame=trainHex,

model_id="introGBM",

nbins_cats=1115,

sample_rate = 0.5,

col_sample_rate = 0.5,

max_depth = 20,

learn_rate=0.05,

seed = 12345678, #Seed for random numbers (affects sampling)

ntrees = 250,

fold_column="fold",

validation_frame=validHex # validation set

)

#Get a summary of the model and variable importance

summary(gbmHex)

varimps = data.frame(h2o.varimp(gbmHex))

# Get predictions out; predicts in H2O, as.data.frame gets them into R

predictions<-as.data.frame(h2o.predict(gbmHex,testHex))

# Return the predictions to the original scale of the Sales data

pred <- expm1(predictions[,1])

summary(pred)

submission <- data.frame(Id=test$Id, Sales=pred)

# Save the submission file

write.csv(submission, "../../data/H2O_GBM_v30.csv",row.names=F)

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central