There are numerous aspects of data science that determine a project’s success; from posing the right questions through to identifying and preparing relevant data, applying suitable analytical techniques, and finally, validating the results. This article focuses on the importance of selecting the appropriate analytical technique by demonstrating how different binary classification algorithms can fail in detecting data patterns using freely available simulation package *mlsim *written in R.

One of the most used, and abused, methods of binary classification is logistic regression. Binary logistic regression is frequently applied to classification problems in areas such as biology, medicine, engineering, finance and insurance, with the belief that it can discern between varieties of data patterns. Users of binary logistic regression not trained in Statistics or Machine Learning are often not aware that the class boundary obtained by estimating parameters is a hyper-plane. Unfortunately, a hyper-plane will, in many cases, poorly delineate the classes of interest for non-linear problems and result in high rate of classification errors.

My initial code, which eventually became *mlsim*, was just a few examples showing how poorly logistic regression and other linear methods (e.g. SVM with linear kernel) were able to fit training samples when applied to data with certain non-linear boundaries.

It is often difficult to explain in simple terms how different machine learning algorithms work, especially to a non-technical audience. To address this issue, I have created the R package *mlsim *as a teaching tool, which first simulates different data patterns and then demonstrates the ability of various Machine Learning algorithms to fit to the data. The *mlsim*, package is also a useful tool for data scientists trying to build a spatial intuition of different Machine Learning algorithms.

The *mlsim* program randomly generates data points marked as little triangles (*T*) and circles (*C*) representing two slightly overlapping classes mostly separated by a shaped boundary, such as a circle, square, sinusoid etc. (a circle or square may not exactly be rendered as such depending on the aspect ratio of you plotting area).

*mlsim* can be downloaded from: https://github.com/jacekko/mlsim_zip/blob/master/mlsim_0.0-1.zip.

To install *mlsim* place the R binary build file *mlsim_0.0-1.zip* in your working R directory and run *install.packages("mlsim_0.0-1.zip", repos=NULL)*. After loading the package with *library(mlsim)*, the package can be run by typing *mlsim()* and following the prompts. *mlsim* has been written in such a way that instead of automatically downloading all the dependencies the user may choose to skip some of the suggested libraries in which case the corresponding options will not appear on the list of algorithms available for demonstration.

The program flow is shown in the figure below.

After the selected algorithm is executed, the diagnostic information generated by *ConfusionMatrix* from package *caret* is printed along with a graphical output consisting of two plots, as per the second figure below. The plot on the left-hand side presents the randomly generated data; while the right-hand plot presents the result of the selected data fit. In this example, logistic regression completely fails to separate the classes, and instead classifies everything as *C *i.e. circles.

Different results are achieved when other algorithms are used, such as recursive partitioning tree (rpart); see below.

Custom class boundaries can be created by a function of two variables *x* and *y* which must be named using a name with suffix “_sh”, e.g. the circle class boundary was set by:

*circle_sh <- function (x, y, r = 3)*

*{*

* return(x^2 + y^2 - r^2)*

*}*

* *

where the parameter *r* is the radius of the circle selected in such a way that the circle fits in the predefined plotting area.

The reader may try the example below.

*myshape_sh <- function(x, y)*

*{*

* if(y != 0)*

* z = x/y*

* else*

* z = 0*

* return(z)*

*}*

* *

The function *myshape_sh *will be found by *mlsim *when you restart it and will appear on the list of shapes.

It should be highlighted, that although selecting an algorithm with good fit to the training data is an important part of a data science project, it does not guarantee that it will work well with a new data set. Even if an algorithm is flexible enough to capture complex data patterns in one data set, this may be due to over-fitting and may result in poor predictions when applied to a new data set. Regardless of prior performance, it is critical to split the available data into training and validation subsets and perform hyper-parameter tuning and cross-validation to minimise the likelihood of over-fitting.

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- Data Science Leadership Exchange: Best Practices for Driving Outcomes

Despite an increasing awareness of the role data science plays in successful business outcomes, data science leaders still struggle to organize, implement and communicate effective data science initiatives.

Join this latest DSC webinar and gain advice on optimizing your data management strategies. Some of the industry’s best and brightest from Bayer, S&P Global and Transamerica will be presenting their insights and experiences. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central