Subscribe to DSC Newsletter

Question about training sets and data splits

Hi All:

I have a question regarding splitting datasets for training purposes. I have seen some people split their data in two categories

  • test data
  • train data

while others split into three fallowing categories

  • train data
  • test data
  • validation data.

Which split data is preferred and why.  

Views: 595

Tags: data, of, split

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Abdul Samad on November 26, 2015 at 6:58am

Thanks Pat Lapomarda 

Comment by Pat Lapomarda on November 26, 2015 at 3:22am

Validation sets are used in training a model to reduce error.  For example, if you are building a classifier, as you add features, you reduce your misclassification rate on your training set; however, this could result in overfitting.  By using a validation set to check the misclassification rate when a new feature is added, you are able to ensure only features that reduce error are added, but since you used the variation in this set for model tuning, you still may not get a good prediction on a new set of data; therefore, the final assessment should be done on the third set of data: test.

Think of it like this:

  • Build Model:
    • Training: pick features
    • Validation: check features
  • Check Model: Testing
Comment by ŞABAN on November 17, 2015 at 10:25am

the second one

Comment by Abdul Samad on November 11, 2015 at 8:27pm

Hi William Vorhies:

Can you furthr explain 

"Because some techniques use the validation as part of tuning the model, that data cannot be seen as a truly independent source of model validation".

C

Comment by William Vorhies on November 11, 2015 at 9:54am

Abdul:

If sufficient data is available I've always been in the three category split school holding out a batch of unseen data for final model evaluation.  Because some techniques use the validation as part of tuning the model, that data cannot be seen as a truly independent source of model validation.  However, testing your model against previously unseen data will give you the most solid confirmation of model performance.

Videos

  • Add Videos
  • View All

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service

console.log("HostName");