This article was written by Sondos Atwi.
What is Cross-Validation?
In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.
The concept of cross validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.
There are several types of cross validation methods (LOOCV – Leave-one-out cross validation, the holdout method, k-fold cross validation). Here, I’m gonna discuss the K-Fold cross validation method.
K-Fold basically consists of the below steps:
Below is a simple illustration of the procedure taken from Wikipedia.
How can it be done with R?
In the below exercise, I am using logistic regression to predict whether a passenger in the famous Titanic dataset has survived or not. The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.
Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the threshold is 0.5, then p1 > threshold and passenger 1 is in the survived category. Whereas, p2 < threshold, so passenger 2 is in the not survived category.
However, and depending on our data, the 0.5 ‘default’ threshold will not alway ensure the maximum the number of correct classifications. In this context, we could use cross validation to determine the best threshold for each fold based on the results of running the model on the validation set.
In my implementation, I followed the below steps: