When we try to build classification models from training data, the proportion of target classes do impact the accuracy levels of predictions. This is an experiment to measure the level of impact of these proportions.
Let us say you are trying to predict which visitors to your website would buy a product. You collect historical data about the visitor's characteristics and actions and also whether they brought something or not. This is the model building data set. The "Buy Decision" variable becomes the target variable we are trying to predict. It has two possible values - "yes" and "no". If 70% of the records in the training data set have "no" in them, then the proportion of classes is 70-30 between "no" and "yes".
If we build a model using this data set, what is the impact of this proportion on overall accuracy of predictions using this model? Will the accuracy be higher if the ratio is 50-50 than 90-10? To test this, we performed multiple iterations of classifications using this base data set. For each iteration, we choose a random data set from a base data set with different proportions between "no" and "yes". The total number of records remains the same for all iterations. Then we split the data set into training and testing sets. The training and testing sets will retain the same proportion of class values. We then built a classification model on the training data set and predicted the test data set. For each iteration, we measured the following
The results are shown in this chart. The X-axis shows the % of "Yes" in the data for that iteration. The 3 lines show the various accuracy levels being measured