.

Among the many decisions you’ll have to make when building a predictive model is whether your business problem is either a classification or an approximation task. It’s an important decision because it determines which group of methods you choose to create a model: classification (decision trees, Naive Bayes) or approximation (regression tree, linear regression).

This short tutorial will help you make the right decision.

Classification works by looking for certain patterns in similar observations from the past and then tries to find the ones which consistently match with belonging to a certain category. If, for example, we would like to predict observations:

- Is a particular email spam? Example categories: “SPAM” & “NOT SPAM”
- Will a particular client buy a product if offered? Example categories: “YES” & “NO”
- What range of success will a particular investment have? Example categories: “Less than 10%”, “10%-20%”, “Over 20%”

Classification works by looking for certain patterns in similar observations from the past and then tries to find the ones which consistently match with belonging to a certain category. If, for example, we would like to predict observations:

- With researched variable y with two categorical values coded blue and red. Empty white dots are unknown – could be either red or blue.
- Using two numeric variables x1 and x2 which are represented on horizontal and vertical axes. As seen below, an algorithm was used which calculated a function represented by the black line. Most of the blue dots are under the line and most of the red dots are over the line. This “guess” is not always correct, however, the error is minimized: only 11 dots are “misclassified”.
- We can predict that empty white dots over the black line are really red and those under the black line are blue. If new dots (for example future observations) appear, we will be able to guess their color as well.

Of course, this is a very simple example and there can be more complicated patterns to look for among hundreds of variables, all of which is not possible to represent graphically.

The approximation is used when we want to predict the probable value of the numeric variable for a particular observation. An example could be:

- How much money will my customer spend on a given product in a year?
- What will the market price of apartments be?
- How often will production machines malfunction each month?

Approximation looks for certain patterns in similar observations from the past and tries to find how they impact the value of a researched variable. If, for example, we would like to predict observations:

- With numeric variable y that we want to predict.
- With numerical variable x1 with value that we want to use to predict the first variable.
- With categorical variable x2 with two categories: left and right, that we want to use to predict the first variable.
- Blue circles represent known observations with known y, x1, x2.
- Since we can’t plot all three variables on a 2d plot, we split them into two 2d plots. The left plot shows how the combination of variables x1 and x2=left is connected to the variable y. The second shows how the combination of variables x1 and x2=right is connected to the variable y.
- The black line represents how our model predicts the relationship between y and x1 for both variants of x2. The orange circle represents new predictions of y on observation when we only know x1 and x2. We put orange circles in the proper place on the black line to get predicted values for particular observations. Their distribution is similar to blue circles.
- As can clearly be seen, distribution and obvious pattern of connection between y and x1 is different for both categories of x2.
- When a new observation arrives, with known x1 and x2, we will be able to make new predictions.

Even if your target variable is a numeric one, sometimes it’s better to use classification methods instead of approximation, for instance, if you have mostly zero target values and just a few non-zero values. Change the latter to 1, in this case you’ll have two categories: 1 (positive value of your target variable) and 0. You can also split the numerical variable into multiple subgroups: apartment prices for low, medium, and high by the equal subset width, and predict them using classification algorithms. This process is called discretization.

Curious about proprietary technology?

Follow Algolytics on LinkedIn.

- 5 Data Cleansing Tools
- ScalaCL - Run Scala on GPUs
- 2016 Trends in Big Data & Network Security
- Exploring the VW scandal with graph analysis
- Be A Star Data Scientist: Certifications For Overall Excellence
- Walmart Kaggle: Trip Type Classification
- 24 Uses of Statistical Modeling (Part I)
- What's Trending on Etsy?
- Scraping Lottery Data
- Architecture of Data Science Projects
- A data scientist shares his passions

Posted 24 June 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central