Optimal Binning for Scoring Modeling (R Package)

What is Binning?

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

  • It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
  • It controls or mitigates the impact of outliers over the model.
  • It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

Unsupervised Discretization
Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

Equal length intervals

  • Objective: Understand the distribution of a variable. 
  • Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
  • Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.


Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Read full article here. For more about optimum binning, read my new article, here

Views: 1149


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service