When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.
Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data set. The "Buy Decision" variable becomes the target variable we are trying to predict. It has two possible values - "yes" and "no". If 70% of the records in the training data set have "no" in them, then the proportion of classes is 70-30 between "no" and "yes".
If we build a model using this data set, what is the impact of this proportion on overall accuracy of predictions using this model? Will the accuracy be higher if the ratio is 50-50 than 90-10? To test this, we performed multiple iterations of classifications using this base data set. For each iteration, we choose a random data set from a base data set with different proportions between "no" and "yes". The total number of records remains the same for all iterations. Then we split the data set into training and testing sets. The training and testing sets will retain the same proportion of class values. We then built a classification model on the training data set and predicted the test data set. For each iteration, we measured the following
The results are shown in this chart. The X-axis shows the % of "Yes" in the data for that iteration. The 3 lines show the various accuracy levels being measured
Comment
Precision is the probability of event among those that you forecast as event, and similarly for non-events. It's the bayesian posterior of accuracy, and in common usage, it's the foundation of lift tables. Accuracy is important in models in which deployment is not terribly important, such as initial research in clinical, epidemiological, etc.
I won't comment on clustering, I don't know where that goes.
Leonardo,
Precision can tell you that your model needs an adjustment, but the goal is still accuracy right?
Also, I would be very interested in an approach to measure a model's precision. All I can think would be ward clustering a distance matrix of the accuracy scores to look for population groups then figure out why the model does not work on that particular group.
I was just thinking...
Would this be a good modelling approach: Cluster out the population groups based on original their precision to the original model and then run separate models on those clusters?
For future predictions you would have to classify the new data into those clusters. Would this be a modelling disaster or something worth experimenting with on Kaggle?
Some observations:
1) accuracy is not that interesting in a business setting. Can you give measures of precision?
2) you don't mention which classifier you used, or maybe I skipped that part.
3) in relation to 2), Owen in a 2007 paper shows that logistic regression, at least with a fixed model, does not suffer from imbalance.
Thanks.
Brendan,
Great idea. i will try that.
"The training and testing sets will retain the same proportion of class values."
Great article here. Wouldn't it be a stronger argument if you kept the testing set to be the same throughout (ex keep it 50% yes, and 50% no")? This way overall accuracy would be more more effectively measured? Overall accuracy would then (hopefully) be peaked at the 50% mark and not the less than 10% and greater than 90% marks. It's misleading to see overall accuracy peak at the ends of the spectrum.
© 2017 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central