Subscribe to DSC Newsletter

Tutorial: How to determine the quality and correctness of classification models? Introduction

What is classification?

Classification is the process of assigning every object from a collection to exactly one class from a known set of classes.
Examples of classification tasks are:

  • assigning a patient (the object) to a group of healthy or ill (the classes) people on the basis of his or her medical record,
  • determining the customer’s (the object) credibility during credit application using, for example, demographic and financial data; in this case the classes are „credible” and „not credible”,
  • determining if the customer (the object) is likely to stop using the company’s services or products on the basis of behavioral and demographic data; in this case the classes are „disloyal customers” and „loyal customers”.

How are classification models created?

1. Data preparation (importing, processing, exploration and statistical analysis)

    This stage divides the data into two or three parts:

  • training data – will be used to build the model
  • validation data (in more complex cases) – will be used for evaluation of model quality during its creation
  • testing data – will be used to establish the final quality of the model

2. Model creation (using training and (optionally) validation)

3. Model quality assessment (testing the created model on testing data)

4. Model application and subsequent monitoring (periodical checks if the quality of predictions does not deteriorate over time, for instance due to demographic or market changes)

What indicators can be used to determine the quality of classification models?

There are two kinds of indicators that can be used to estimate the quality of classification models:

  • Quantitative quality indicators – statistics which express the quality of classification using numerical values.
  • Graphical indicators – the quality of classification is represented on a graph which combines selected quantitative indicators. Graphical methods simplify model quality assessment and visualize classification results. Such indicators include:
    • Confusion matrix
    • ROC curve
    • LIFT chart

Basic notions used in the assessment of the quality of classification models

Binary and multiclass classification

Binary classification:
  • one class is defined as positive (also known as target class, rare class or minority class)
  • other class is defined as negative (also known as normal class)
Multiclass classification:
  • one class is defined as positive
  • other classes combined are defined as negative

Positive class should collect objects which should be identified during modeling: for example in churn modeling the positive class would consist of resigning customers; in credit scoring projects the positive class consists of customers who defaulted on their debts. (In both cases the negative class consists of the remaining customers).

TP, TN, FP, FN

  • TP – True Positive – the number of observations correctly assigned to the positive class
    Example: the model’s predictions are correct and resigning customers have been assigned to the class of „disloyal” customers
  • TN – True Negative – the number of observations correctly assigned to the negative class
    Example: the model’s predictions are correct and customers who continue using the service have been assigned to the class of „loyal” customers.
  • FP – False Positive – the number of observations assigned by the model to the positive class, which in reality belong to the negative class.
    Example: unfortunately the model is not perfect and made a mistake: some customers, who continue using the service have been assigned to the class of „disloyal” customers.
  • FN – False Negative – the number of observations assigned by the model to the negative class, which in reality belong to the positive class.
    Example: unfortunately the model is not perfect and made a mistake: some churning customers have been assigned to the class of „loyal” customers.

For a perfect classifier (i.e. every observation has been correctly classified) we would have:
FP = 0
FN = 0
TP = number of all observations from the positive class
TN = number of all observations from the positive class

Pos = TP + FN – number of all observations which in reality belong to the positive class
Neg = FP + TN – number of all observations which in reality belong to the negative class

You now have basic knowledge about classification models.

Interested in similar content? Sign up for Newsletter

You can follow us at @Algolytics

Views: 4735

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Sione Palu on May 19, 2015 at 8:53am

Classification schemes keep evolving & improving with recent publications. Those recent techniques involve multi-output classifications, ie, the response variable/s is 2 or more in comparison to standard classification of just a single variable say Y.  The multi-class MIMO  SVR  (multi input multi output - support vector regression) is one of those new techniques, eg :  the multi output could be 3 variables (as Gender, Age-bracket, Earning-bracket)  & may be denoted as [G, A, E], where gender is 2 class (male, female),  age-bracket is multiclass (student, young-adult, adult, retired) & age-bracket is also multiclass.  MIMO SVR can predict the 3 output variables class labels at once. The other multiclass MIMO schemes includes CANFIS (Co-Active Neuro-Fuzzy Inference System) & its variants. CANFIS is still popular in engineering (control systems design, signal processing) but it has other applications outside of that domain in recent publications. Classification (single output only) using multi-mode/multi-dimensional tensor data has started to appear in the literature in recent years such as "Nonparametric Bayes tensor factorizations". IMO, the sophistications of classification schemes will keep evolving.

Comment by Richard Ordowich on May 18, 2015 at 7:08am

Many classification schemes were not designed using any rigorous techniques such as taxonomy and many have evolved overtime without examining the cohesiveness of the terms and their meanings. Some classifications are ill defined and ambiguous. Before using classifications it is important to examine their quality.

Reverse engineering classifications in a taxonomy and relating the various taxonomies in an ontology helps to expose the anomalies, inconsistencies, duplications, various interpretations and errors in the data. This process helps to ensure that the classification data as it exists is better understood and errors are exposed. Only then can the classification data be used.

Comment by Nagaraj Kulkarni on May 14, 2015 at 8:19pm

Thanks for  a short and sweet insight into data classification. 

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service