# Subscribe to DSC Newsletter

Recently Kaggle master Kazanova along with some of his friends released a "How to win a data science competition" Coursera course. The Course involved a final project which itself was a time series prediction problem. Here I will describe how I got a top 10 position as of writing this article. Description of the Problem:

In this competition, we were given a challenging time-series dataset consisting of daily sales data, kindly provided by one of the largest Russian software firms - 1C Company.

We were asked you to predict total sales for every product and store in the next month.

The evaluation metric was RMSE where True target values are clipped into [0,20] range. This target range will be a lot important in understanding the submissions that I will prepare.

The main thing that I noticed was that the data preparation aspect of this competition was by far the most important thing. I created a variety of features. Here are the steps I took and the features I created.

## 1. Created a data frame of all Date_block_num, Store and Item combinations:

This is important because in the months we don't have data for an item store combination, the machine learning algorithm needs to be specifically told that the sales is zero.

``````from itertools import product# Create "grid" with columns

index_cols = ['shop_id', 'item_id', 'date_block_num']
# For every month we create a grid from all shops/items combinations from that month
grid = []
for block_num in sales['date_block_num'].unique():
cur_shops = sales.loc[sales['date_block_num'] == block_num, 'shop_id'].unique()
cur_items = sales.loc[sales['date_block_num'] == block_num, 'item_id'].unique()
grid.append(np.array(list(product(*[cur_shops, cur_items, [block_num]])),dtype='int32'))
grid = pd.DataFrame(np.vstack(grid), columns = index_cols,dtype=np.int32)
``````

## 2. Cleaned up a little of sales data after some basic EDA:

``sales = sales[sales.item_price<100000]sales = sales[sales.item_cnt_day<=1000]``

## 3. Created Mean Encodings:

``````sales_m = sales.groupby(['date_block_num','shop_id','item_id']).agg({'item_cnt_day': 'sum','item_price': np.mean}).
reset_index()

sales_m = pd.merge(grid,sales_m,on=['date_block_num','shop_id','item_id'],
how='left').fillna(0)

# adding the category id toosales_m = pd.merge(sales_m,items,on=['item_id'],how='left')
for type_id in ['item_id','shop_id','item_category_id']:
for column_id,aggregator,aggtype in
[('item_price',np.mean,'avg'),('item_cnt_day',np.sum,'sum'),
('item_cnt_day',np.mean,'avg')]:

mean_df = sales.groupby([type_id,'date_block_num']).
aggregate(aggregator).reset_index()
[[column_id,type_id,'date_block_num']]
mean_df.columns = [type_id+'_'+aggtype+'_'+column_id,type_id,'date_block_num']
sales_m = pd.merge(sales_m,mean_df,on=['date_block_num',type_id],how='left')
``````

These above lines add the following 9 features :

• 'item_id_avg_item_price'
• 'item_id_sum_item_cnt_day'
• 'item_id_avg_item_cnt_day'
• 'shop_id_avg_item_price',
• 'shop_id_sum_item_cnt_day'
• 'shop_id_avg_item_cnt_day'
• 'item_category_id_avg_item_price'
• 'item_category_id_sum_item_cnt_day'
• 'item_category_id_avg_item_cnt_day'

## 4. Create Lag Features:

Next we create lag features with diferent lag periods on the following features:

• 'item_id_avg_item_price',
• 'item_id_sum_item_cnt_day'
• 'item_id_avg_item_cnt_day'
• 'shop_id_avg_item_price'
• 'shop_id_sum_item_cnt_day'
• 'shop_id_avg_item_cnt_day'
• 'item_category_id_avg_item_price'
• 'item_category_id_sum_item_cnt_day'
• 'item_category_id_avg_item_cnt_day'
• 'item_cnt_day'
``````lag_variables  = list(sales_m.columns[7:])+['item_cnt_day']lags = [1 ,2 ,3 ,4, 5, 12]

for lag in lags:    sales_new_df = sales_m.copy()
sales_new_df.date_block_num+=lag
sales_new_df = sales_new_df[['date_block_num','shop_id','item_id']+lag_variables]
sales_new_df.columns = ['date_block_num','shop_id','item_id']+ [lag_feat+'_lag_'+str(lag) for lag_feat in lag_variables]
sales_means = pd.merge(sales_means, sales_new_df,on=['date_block_num','shop_id','item_id'] ,how='left')
``````

## 5. Fill NA with zeros:

``````for feat in sales_means.columns:    if 'item_cnt' in feat:
sales_means[feat]=sales_means[feat].fillna(0)    elif 'item_price' in feat:        sales_means[feat]=sales_means[feat].fillna(sales_means[feat].median())
``````

## 6. Drop the columns that we are not going to use in training:

``cols_to_drop = lag_variables[:-1] + ['item_name','item_price']``

## 7. Take a recent bit of data only:

``sales_means = sales_means[sales_means['date_block_num']>12]``

## 8. Split in train and CV :

``X_train = sales_means[sales_means['date_block_num']<33].drop(cols_to_drop, axis=1)X_cv =  sales_means[sales_means['date_block_num']==33].drop(cols_to_drop, axis=1)``

## 9. THE MAGIC SAUCE:

In the start, I told that the clipping aspect of [0,20] will be important. In the next few lines, I clipped the days to range[0,40]. You might ask me why 40. An intuitive answer is if I had clipped to range [0,20] there would be very few tree nodes that could give 20 as an answer. While if I increase it to 40 having a 20 becomes much more easier. Please note that We will clip our predictions in the [0,20] range in the end.

``````def clip(x):if x>40:
return 40    elif x<0:        return 0
else:
return x
train['item_cnt_day'] = train.apply(lambda x: clip(x['item_cnt_day']),axis=1)
cv['item_cnt_day'] = cv.apply(lambda x: clip(x['item_cnt_day']),axis=1)
``````

## 10: Modelling:

• Created a XGBoost model to get the most important features(Top 42 features)
• Use hyperopt to tune xgboost
• Used top 10 models from tuned XGBoosts to generate predictions.
• clipped the predictions to [0,20] range
• Final solution was the average of these 10 predictions.

Learned a lot of new things from this awesome course. Most recommended.

Originally published on MLWhiz

Views: 2395

Tags: data, for, kaggle, science, series, time, tips, xgboost

Comment

### You need to be a member of Data Science Central to add comments!

Join Data Science Central

## Videos

• ### DSC Webinar Series: Mathematical Optimization + ML: Featuring Forrester Survey Insights

Added by Tim Matteson

• ### DSC Webinar Series: How to Use Time Series Data to Forecast at Scale

Added by Tim Matteson

• ### DSC Webinar Series: Why Data Prep is Step 1 for Analytics Success

Added by Tim Matteson