Subscribe to DSC Newsletter

Impact of target class proportions on accuracy of classification

When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.


Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data set. The "Buy Decision" variable becomes the target variable we are trying to predict. It has two possible values - "yes" and "no". If 70% of the records in the  training data set have "no" in them, then the proportion of classes is 70-30 between "no" and "yes".

If we build a model using this data set, what is the impact of this proportion on overall accuracy of predictions using this model? Will the accuracy be higher if the ratio is 50-50 than 90-10? To test this, we performed multiple iterations of classifications using this base data set. For each iteration, we choose a random data set from a base data set with  different proportions between "no" and "yes". The total number of records remains the same for all iterations. Then we split the data set into training and testing sets. The training and testing sets will retain the same proportion of class values. We then built a classification model on the training data set and predicted the test data set. For each iteration, we measured the following

  • Overall accuracy
  • Accuracy of "No" predictions - how well we predict "No"
  • Accuracy of "Yes" predictions - how well we predict "Yes".

The results are shown in this chart. The X-axis shows the % of "Yes" in the data for that iteration. The 3 lines show the various accuracy levels being measured

Read more here.
Join Data Science Central to comment on this post.

Views: 1837

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by leonardo auslender on March 24, 2015 at 3:36am

Precision is the probability of event among those that you forecast as event, and similarly for non-events. It's the bayesian posterior of accuracy, and in common usage, it's the foundation of lift tables. Accuracy is important in models in which deployment is not terribly important, such as initial research in clinical, epidemiological, etc.

I won't comment on clustering, I don't know where that goes.

Comment by Brendan White on March 23, 2015 at 12:24pm

Leonardo,

Precision can tell you that your model needs an adjustment, but the goal is still accuracy right?

Also, I would be very interested in an approach to measure a model's precision. All I can think would be ward clustering a distance matrix of the accuracy scores to look for population groups then figure out why the model does not work on that particular group.

I was just thinking...

Would this be a good modelling approach: Cluster out the population groups based on original their precision to the original model and then run separate models on those clusters?

For future predictions you would have to classify the new data into those clusters. Would this be a modelling disaster or something worth experimenting with on Kaggle? 

Comment by leonardo auslender on March 23, 2015 at 6:47am

Some observations:

1) accuracy is not that interesting in a business setting. Can you give measures of precision?

2) you don't mention which classifier you used, or maybe I skipped that part.

3) in relation to 2), Owen in a 2007 paper shows that logistic regression, at least with a fixed model, does not suffer from imbalance.

Thanks.

Comment by Kumaran Ponnambalam on March 22, 2015 at 6:45pm

Brendan,

Great idea. i will try that.

Comment by Brendan White on March 22, 2015 at 11:48am

"The training and testing sets will retain the same proportion of class values."

Great article here. Wouldn't it be a stronger argument if you kept the testing set to be the same throughout (ex keep it 50% yes, and 50% no")? This way overall accuracy would be more more effectively measured? Overall accuracy would then (hopefully) be peaked at the 50% mark and not the less than 10% and greater than 90% marks. It's misleading to see overall accuracy peak at the ends of the spectrum.

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service