# .

Guest blog post by Dalila Benachenhou, originally posted here. Dalila is Professor at George Washington University. In this article, benchmarks were computed on a specific data set, for Geico Calls Prediction, comparing Random Forests, Neural Networks, SVM, FDA, K Nearest Neighbors, C5.0 (Decision Trees), Logistic Regression, and Cart.

Introduction:

In the first blog, we decided on the predictors.  In the second blog we pre-processed the numerical variables.  We knew that different predictive models have different assumptions about their predictors.  Random Forest has none, but Logistic Regression requires normality of the continuous variables, and assumes the probability between 2 consecutive unit levels in a series of numbers to stay constant.  K Nearest Neighbors requires the predictors to be at least on the same scale.  SVM, Logistic Regression, and Neural Networks tend to be sensitive to outliers.  hence, we had to cater to these different predictive model assumptions and needs.   We changed Predictors that need to be changed to categorical variables, or we made sure to scale them.  We transformed Age and Tenure (the only 2 continuous variables) distributions to a normal distribution and then scaled and centered them.  We came up with these 2 sets of predictors:

New_Predictors <- c="" c_logins="" p="" ustomersegment="" z_age="" z_channel23m="" z_channel33m="" z_channel43m="" z_channel53m="" z_payments3m="" z_tenure="">
LR_new_predictors <- c="" c_channel23m="" c_channel33m="" c_channel43m="" c_channel53m="" c_logins="" c_payments3m="" nbsp="" p="" ustomersegment="" z_age="" z_tenure="">
LR_new_predictors are predictors for Logistic Regression.   We used LR_new_predictors for both Logistic Regression, Random Forest, and K Nearest Neighbors.

Variables starting with z are at least scaled to 0 and 1 if they were discrete variables, or normalized scaled and centered, if they were continuous.   Variables starting with c are the categorical form of the original variables.  We decided on their number of levels by the significance of their odd ratios.

Method:

We decided to combine day15 and day16.  From this combination, we split our dataset into Training set (80%), and Validation set (1/3 of 20%) , and Testing Set (2/3 of 20%).  Due to the extreme class imbalance, we decided to use F1 Score to measure the models performance, and  try 4 approaches to model building.

In the first approach, we used down sampling during model building, and  found the maximum threshold with the validation set.  We set the threshold to be the point where Sensitivity (Predict) and Positive Predictive Value (Recall) are highest in the ROC curve.

In the second approach, we down sampled during training phase, and tuned the models with a utility function during the validation phase.  In the third approach, we built a C5.0 model with cost, and a Random Forest with a prior class probability and a utility function. In the fourth, and last, approach, we combined the models to create an "ensemble".

For predictive models, in addition to Random Forest, and C5.0, we built a Neural Network,  a Support Vector Machine with Radial Function, a K Nearest Neighbors, a Logistic Regression, a Flexible Discriminant Analysis (FDA) and a CART model.

PARAMETERS VALUES FOR THE PREDICTIVE MODELS:

Random Forest:  mtry = 2, number of trees = 500
Neural Network: tried with : size = 1 to 10, decay  values  0, 0.1, 1, and 2, and bag = FALSE
SVM :  used RADIAL basis function Kernel  with C (cost)  = 1
FDA  : degree = 1 and nprune= 1 to 25, the best nprune = 18
K Nearest Neighbors: k = 1 to 25, the best k = 20
C5.0 : Cost Matrix

 No Yes No 5 -25 Yes 0 10

Logistic Regression :  None
CART : default values

Results:

Many of you will be wondering why we have Random Forest.  In the previous blog, we provided raw predictors to the Random Forest.  In here, we provide the LR_new_predictors to the Random Forest.  We want to see if pre-processed predictors result in a higher Random Forest performance.  Except for Age and Tenure, the new_predictors are mostly categorical variables.

 Yes Precision No Precision Yes Recall No Recall Yes F1 Score No F1 Score Random Forest 80.2 83.42 16.44 99.04 27.29 90.56 Neural Network 74.85 81.13 13.89 98.76 23.43 89.08 SVM 74.85 78.3 12.3 98.71 21.13 87.33 FDA 72.46 78.86 12.24 98.51 20.94 87.60 KNN 69.05 78.58 11.65 98.15 19.94 87.28 Logistic Reg 75.54 75.5 11.13 98.7 19.40 85.56 CART 73.65 74.96 10.69 98.59 18.67 85.17 Median 74.85 78.58 12.24 98.7 20.94 87.33

From the Calibration plot, except for CART (rpart package used), all the other models had good calibration and good performance.

In addition to this, we also used utility functions to improve Random Forest, Neural Network, SVM, and K Nearest Neighbors.  In here we utilized down sampling, and then found a utility function using validation data set to tune our models.

 Yes Precision No Precision Yes  Recall No Recall Yes F1 Score No F1 Score Random Forest 72.46 90.77 24.2 98.78 36.28 94.6 Neural Network 70.66 80.03 12.58 98.53 21.36 88.32 KNN 67.07 78.96 11.48 98.33 19.6 87.59 SVM 77.84 73.65 10.73 98.79 18.86 84.39

Max Kuhn, Kjell Johnson, and many machine learning expert, we were told that the best approach is to down sample when building a predictive model.  Still, we decided to build C5.0 with cost, and  Random Forest with prior class probability and a utility function at the testing level.  We were blown out by the result from Random Forest.    The performance was extremely good, and better than all the other models.  Yes, its Yes precision, and its No Recall values were lower than C5.0, but the F1 Score were much higher (79.88 vs 30.49) and (99.17 vs 91.70.)

 Yes Precision No Precision Yes  Recall No Recall Yes F1 Score No F1 Score Random Forest 80.83 99.12 78.94 98.21 79.88 99.17 C5.0 83.23 85.24 18.66 99.21 30.49 91.70

The utility function for the Random Forest is as follows:

 No Yes No 5 -25 Yes 0 10

For C5.0 to work, we had to set up our cost for miss categorizing Yes to a high value 72, and we had to put a little cost for miss-classifying No.  We also had to set Trial to 1, and winnow to FALSE.  We could have improved C5.0 performance by using Monte Carlo to find the optimal cost values.

 No Yes No 0 72 Yes 5 0

While the precision for Yes decreased, the Yes recall improved a lot.  Now, for every Yes prediction, we have 24.2 that are true Yes.  For SVM, we improved the Precision for Yes at the expense of the precision of No, and the Recall for Yes.  For KNN we improved the Precision for
No, and slightly the F1 Score for No.

What if we use the majority wins approach to decide on our prediction:

 Yes Precision No Precision Yes Recall No Recall Yes F1 Score No F1 Score Majority Wins 76.65 77.86 12.34 98.80 21.26 87.09

We will improve the Yes F1 Score, but we have a slight insignificant decrease in No F1 Score.  For Yes Precision, the majority vote performance is lower only to Random Forest.  For Yes Recall, the majority vote performance is lower than Random Forest and Neural Network Yes Recall.  However, for No, the performance is higher only to Cart and Logistic Regression.  In here, we have the get better prediction of the rarest event at the expense of the most common.  A more sophisticated ensemble of all these predictive model will more likely produce better prediction that average.

Code:

#Code to get the majority wins

results <- data.frame="" rf="newValuerf," svm="newValueSVM,</font">
LR = newValueLR,rpart =newValuerpart,
KNN = newValueKnn,NN=newValueNN,FDA=newValueFDA)
for (i in 1:4273) {

m <- count="" font="" i="" results="" t="">
l <- dim="" font="" m="">
if (l == 2) {
if ( m[1,2] >  m[2,2]) {
outcome[i] = m[1,1]
}
else {
outcome[i] = m[2,1]
}
}
else {
outcome[i] = m[1,1]
}

} Conclusion:

We created 22 models.  Even though the classes were extremely imbalanced (96.2% vs. 3.8%), all the models performed well.  We designed 4 type of experiments.  The first approach required down sampling when building the model, and then finding the threshold from the validation set to optimize it.  The second approach still required down sampling, but then finding a utility function  from the validation set to improve the models.  The third model used only prior probability of the classes and either cost matrix for C5.0, or a utility function for Random Forest.  The final approach combines the predictions from the first approach.
In all these models, Random Forest emerged as the winner.  The model was able to recall 78.94% and have a precision of 80.83% of yes.  This means that the model predicts a Yes, out of all the Yes that it will predict 78.94% will be Yes.  At the same time, of all the actual Yes, it was able to detect 80.83%. One may ask why?  We think the great performance has to do with pre-processing and the type of predictors.  By taking time to understand the variables characteristics, we found out that many could have been changed to categorical variables with few levels (less than 5.)  At the end of pre-processing, only 2 variables stayed numerical (Age and Tenure.)
.
.
Note from the Editor:
The first part of this blog - pre-processing predictors - can be found here, and has interesting visualizations.

Views: 9898

Comment Comment by Michael Clayton on March 15, 2016 at 12:39pm