Home » Uncategorized

Performance From Various Predictive Models

Guest blog post by Dalila Benachenhou, originally posted here. Dalila is Professor at George Washington University. In this article, benchmarks were computed on a specific data set, for Geico Calls Prediction, comparing Random Forests, Neural Networks, SVM, FDA, K Nearest Neighbors, C5.0 (Decision Trees), Logistic Regression, and Cart.

Introduction:

In the first blog, we decided on the predictors.  In the second blog we pre-processed the numerical variables.  We knew that different predictive models have different assumptions about their predictors.  Random Forest has none, but Logistic Regression requires normality of the continuous variables, and assumes the probability between 2 consecutive unit levels in a series of numbers to stay constant.  K Nearest Neighbors requires the predictors to be at least on the same scale.  SVM, Logistic Regression, and Neural Networks tend to be sensitive to outliers.  hence, we had to cater to these different predictive model assumptions and needs.   We changed Predictors that need to be changed to categorical variables, or we made sure to scale them.  We transformed Age and Tenure (the only 2 continuous variables) distributions to a normal distribution and then scaled and centered them.  We came up with these 2 sets of predictors:

New_Predictors <- c=”” c_logins=”” p=”” ustomersegment=”” z_age=”” z_channel23m=”” z_channel33m=”” z_channel43m=”” z_channel53m=”” z_payments3m=”” z_tenure=””>
LR_new_predictors <- c=”” c_channel23m=”” c_channel33m=”” c_channel43m=”” c_channel53m=”” c_logins=”” c_payments3m=”” nbsp=”” p=”” ustomersegment=”” z_age=”” z_tenure=””>
LR_new_predictors are predictors for Logistic Regression.   We used LR_new_predictors for both Logistic Regression, Random Forest, and K Nearest Neighbors.

Variables starting with z are at least scaled to 0 and 1 if they were discrete variables, or normalized scaled and centered, if they were continuous.   Variables starting with c are the categorical form of the original variables.  We decided on their number of levels by the significance of their odd ratios.

Method:

We decided to combine day15 and day16.  From this combination, we split our dataset into Training set (80%), and Validation set (1/3 of 20%) , and Testing Set (2/3 of 20%).  Due to the extreme class imbalance, we decided to use F1 Score to measure the models performance, and  try 4 approaches to model building. 

In the first approach, we used down sampling during model building, and  found the maximum threshold with the validation set.  We set the threshold to be the point where Sensitivity (Predict) and Positive Predictive Value (Recall) are highest in the ROC curve. 

In the second approach, we down sampled during training phase, and tuned the models with a utility function during the validation phase.  In the third approach, we built a C5.0 model with cost, and a Random Forest with a prior class probability and a utility function. In the fourth, and last, approach, we combined the models to create an “ensemble”.

For predictive models, in addition to Random Forest, and C5.0, we built a Neural Network,  a Support Vector Machine with Radial Function, a K Nearest Neighbors, a Logistic Regression, a Flexible Discriminant Analysis (FDA) and a CART model.

PARAMETERS VALUES FOR THE PREDICTIVE MODELS:

Random Forest:  mtry = 2, number of trees = 500
Neural Network: tried with : size = 1 to 10, decay  values  0, 0.1, 1, and 2, and bag = FALSE
SVM :  used RADIAL basis function Kernel  with C (cost)  = 1
FDA  : degree = 1 and nprune= 1 to 25, the best nprune = 18
K Nearest Neighbors: k = 1 to 25, the best k = 20
C5.0 : Cost Matrix

No
Yes
No
5
-25
Yes
0
10


Logistic Regression :  None
CART : default values

Results:

Many of you will be wondering why we have Random Forest.  In the previous blog, we provided raw predictors to the Random Forest.  In here, we provide the LR_new_predictors to the Random Forest.  We want to see if pre-processed predictors result in a higher Random Forest performance.  Except for Age and Tenure, the new_predictors are mostly categorical variables.

Yes Precision
No Precision
Yes Recall
No Recall
Yes F1 Score
No F1 Score
Random Forest
80.2
83.42
16.44
99.04
27.29
90.56
Neural Network
74.85
81.13
13.89
98.76
23.43
89.08
SVM
74.85
78.3
12.3
98.71
21.13
87.33
FDA
72.46
78.86
12.24
98.51
20.94
87.60
KNN
69.05
78.58
11.65
98.15
19.94
87.28
Logistic Reg
75.54
75.5
11.13
98.7
19.40
85.56
CART
73.65
74.96
10.69
98.59
18.67
85.17
Median
74.85
78.58
12.24
98.7
20.94
87.33

From the Calibration plot, except for CART (rpart package used), all the other models had good calibration and good performance. 

In addition to this, we also used utility functions to improve Random Forest, Neural Network, SVM, and K Nearest Neighbors.  In here we utilized down sampling, and then found a utility function using validation data set to tune our models.

Yes Precision
No Precision
Yes  Recall
No Recall
Yes F1 Score
No F1 Score
Random Forest
72.46
90.77
24.2
98.78
36.28
94.6
Neural Network
70.66
80.03
12.58
98.53
21.36
88.32
KNN 
67.07
78.96
11.48
98.33
19.6
87.59
SVM
77.84
73.65
10.73
98.79
18.86
84.39

Max Kuhn, Kjell Johnson, and many machine learning expert, we were told that the best approach is to down sample when building a predictive model.  Still, we decided to build C5.0 with cost, and  Random Forest with prior class probability and a utility function at the testing level.  We were blown out by the result from Random Forest.    The performance was extremely good, and better than all the other models.  Yes, its Yes precision, and its No Recall values were lower than C5.0, but the F1 Score were much higher (79.88 vs 30.49) and (99.17 vs 91.70.)

Yes Precision
No Precision
Yes  Recall
No Recall
Yes F1 Score
No F1 Score
Random Forest 
80.83
99.12
78.94
98.21
79.88
99.17
C5.0
83.23
85.24
18.66
99.21
30.49
91.70

The utility function for the Random Forest is as follows:

No
Yes
No
5
-25
Yes
0
10
For C5.0 to work, we had to set up our cost for miss categorizing Yes to a high value 72, and we had to put a little cost for miss-classifying No.  We also had to set Trial to 1, and winnow to FALSE.  We could have improved C5.0 performance by using Monte Carlo to find the optimal cost values.

No
Yes
No
0
72
Yes
5
0

While the precision for Yes decreased, the Yes recall improved a lot.  Now, for every Yes prediction, we have 24.2 that are true Yes.  For SVM, we improved the Precision for Yes at the expense of the precision of No, and the Recall for Yes.  For KNN we improved the Precision for
No, and slightly the F1 Score for No.

What if we use the majority wins approach to decide on our prediction:

Yes Precision
No Precision
Yes Recall
No Recall
Yes F1 Score
No F1 Score
Majority Wins
76.65
77.86
12.34
98.80
21.26
87.09

We will improve the Yes F1 Score, but we have a slight insignificant decrease in No F1 Score.  For Yes Precision, the majority vote performance is lower only to Random Forest.  For Yes Recall, the majority vote performance is lower than Random Forest and Neural Network Yes Recall.  However, for No, the performance is higher only to Cart and Logistic Regression.  In here, we have the get better prediction of the rarest event at the expense of the most common.  A more sophisticated ensemble of all these predictive model will more likely produce better prediction that average.

Code:

#Code to get the majority wins

results <- data.frame=”” rf=”newValuerf,” svm=”newValueSVM,</font”>
                      LR = newValueLR,rpart =newValuerpart,
                      KNN = newValueKnn,NN=newValueNN,FDA=newValueFDA)
for (i in 1:4273) {

  m <- count=”” font=”” i=”” results=”” t=””>
  l <- dim=”” font=”” m=””>
  if (l[1] == 2) {
    if ( m[1,2] >  m[2,2]) {
      outcome[i] = m[1,1]
    }
    else {
      outcome[i] = m[2,1]
    }
  }
  else {
      outcome[i] = m[1,1]
  }

}

CalCurveGeico
Conclusion:

We created 22 models.  Even though the classes were extremely imbalanced (96.2% vs. 3.8%), all the models performed well.  We designed 4 type of experiments.  The first approach required down sampling when building the model, and then finding the threshold from the validation set to optimize it.  The second approach still required down sampling, but then finding a utility function  from the validation set to improve the models.  The third model used only prior probability of the classes and either cost matrix for C5.0, or a utility function for Random Forest.  The final approach combines the predictions from the first approach.  
In all these models, Random Forest emerged as the winner.  The model was able to recall 78.94% and have a precision of 80.83% of yes.  This means that the model predicts a Yes, out of all the Yes that it will predict 78.94% will be Yes.  At the same time, of all the actual Yes, it was able to detect 80.83%. One may ask why?  We think the great performance has to do with pre-processing and the type of predictors.  By taking time to understand the variables characteristics, we found out that many could have been changed to categorical variables with few levels (less than 5.)  At the end of pre-processing, only 2 variables stayed numerical (Age and Tenure.)
.

Read full article here
.
Note from the Editor:
The first part of this blog – pre-processing predictors – can be found here, and has interesting visualizations.