Understanding Cross Validation across the Data Science pipeline

Cross validation is a technique commonly used In Data Science. Most people think that it plays a small part in the data science pipeline, i.e. while training the model. However, it has a broader application in model selection and hyperparameter tuning.

Let us first explore the process of cross validation itself and then see how it applies to different parts of the data science pipeline

Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample. In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data.

In the model training phase, Cross-validation is primarily used in applied machine learning to estimate the skill of a machine learning model on unseen data to overcome situations like overfitting. The choice of k is usually 5 or 10, but there is no formal rule. Cross validation is implemented through KFold() scikit-learn class. Taken to one extreme, for k = 1, we get a single train/test split is created to evaluate the model. There are also other forms of cross validation ex stratified cross validation


Image source: see here 

Now, let’s recap the end to end steps for classification based on THIS classification  code which we use in "learn machinelearning coding basics in a weekend":

Classification code outline

1. Load the data

2. Exploratory data analysis

  • Analyse the target variable,
  • Check if the data is balanced,
  • Check the co-relations

3. Split the data

4. Choose a Baseline algorithm

  • Train and Test the Model
  • Choose an evaluation metric
  • Refine our dataset
  • Feature engineering

5. Test Alternative Models -- Ensemble models 

6. Choose the best model and optimise its parameters

In this context, we outline below two more cases where we can use cross validation

  1. In choice of alternate models and
  2. In hyperparameter tuning

we explain these below

1. Choosing alternate models:

If we have two models, and we want to see which one is better, we can use cross validation to compare the two for a given dataset.  For the code listed above, this is shown in the following section.

"""### Test Alternative Models

logistic = LogisticRegression()

cross_val_score(logistic, X, y, cv=5, scoring="accuracy").mean()

rnd_clf = RandomForestClassifier()

cross_val_score(rnd_clf, X, y, cv=5, scoring="accuracy").mean()


2. Hyperparameter tuning

Finally, cross validation is also used in hyperparameter tuning

As per cross validation parameter tuning grid search

In machine learning, two tasks are commonly done at the same time in data pipelines: cross validation and (hyper)parameter tuning. Cross validation is the process of training learners using one set of data and testing it using a different set. Parameter tuning is the process to selecting the values for a model’s parameters that maximize the accuracy of the model.”

 So, to conclude, cross validation is a technique used in multiple parts of the data science pipeline

Views: 1663


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on May 16, 2019 at 12:31pm

In addition to, or in combination with cross-validation, I recommend model blending. The hidden decision trees methodology that I developed a while back, do just that:

  • decision trees on observations that are better handled by decision trees
  • regression on observations that are better handled by regression

Each model produces a score, and both scores (regression based or decision tree based) are re-scaled so that the two scores are interchangeable. 

See details here:

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service