*This article was written by Sondos Atwi.*

**What is Cross-Validation?**

In Machine Learning, Cross-validation is a resampling method used for model evaluation to avoid testing a model on the same dataset on which it was trained. This is a common mistake, especially that a separate testing dataset is not always available. However, this usually leads to inaccurate performance measures (as the model will have an almost perfect score since it is being tested on the same data it was trained on). To avoid this kind of mistakes, cross validation is usually preferred.

The concept of cross validation is actually simple: Instead of using the whole dataset to train and then test on same data, we could randomly divide our data into training and testing datasets.

There are several types of cross validation** **methods (LOOCV – Leave-one-out cross validation,** **the holdout method,** **k-fold cross validation). Here, I’m gonna discuss the K-Fold cross validation method.

K-Fold basically consists of the below steps:

- Randomly split the data into k subsets, also called folds.
- Fit the model on the training data (or k-1 folds).
- Use the remaining part of the data as test set to validate the model. (Usually, in this step the accuracy or test error of the model is measured).
- Repeat the procedure k times.

Below is a simple illustration of the procedure taken from *Wikipedia.*

**How can it be done with R?**

In the below exercise, I am using logistic regression to predict whether a passenger in the famous *Titanic *dataset has survived or not. The purpose is to find an optimal threshold on the predictions to know whether to classify the result as 1 or 0.

Threshold Example:

Consider that the model has predicted the following values for two passengers: p1 = 0.7 and p2 = 0.4. If the threshold is 0.5, then p1 > threshold and passenger 1 is in the survived category. Whereas, p2 < threshold**, **so passenger 2 is in the not survived** **category.

However, and depending on our data, the 0.5 ‘default’ threshold will not alway ensure the maximum the number of correct classifications. In this context, we could use cross validation** **to determine the best threshold for each fold based on the results of running the model on the validation set.

In my implementation, I followed the below steps:

- Split the data randomly into 80
*(train and validation),*20*(test with unseen data)*. - Run cross-validation on 80% of the data, which will be used to train and validate the model.
- Get the optimal threshold after running the model on the validation dataset according to the best accuracy at each fold iteration.
- Store the best accuracy and the optimal threshold resulting from the fold iterations in a dataframe.
- Find the best threshold (the one that has the highest accuracy) and use it as a cutoff when testing the model against the test dataset.

*To read the full original article click here. For more cross validation related articles on DSC click here.*

**DSC Resources**

- Services: Hire a Data Scientist | Search DSC | Classifieds | Find a Job
- Contributors: Post a Blog | Ask a Question
- Follow us: @DataScienceCtrl | @AnalyticBridge

Popular Articles

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central