There are numerous aspects of data science that determine a project’s success; from posing the right questions through to identifying and preparing relevant data, applying suitable analytical techniques, and finally, validating the results. This article focuses on the importance of selecting the appropriate analytical technique by demonstrating how different binary classification algorithms can fail in detecting data patterns using freely available simulation package mlsim written in R.
One of the most used, and abused, methods of binary classification is logistic regression. Binary logistic regression is frequently applied to classification problems in areas such as biology, medicine, engineering, finance and insurance, with the belief that it can discern between varieties of data patterns. Users of binary logistic regression not trained in Statistics or Machine Learning are often not aware that the class boundary obtained by estimating parameters is a hyper-plane. Unfortunately, a hyper-plane will, in many cases, poorly delineate the classes of interest for non-linear problems and result in high rate of classification errors.
My initial code, which eventually became mlsim, was just a few examples showing how poorly logistic regression and other linear methods (e.g. SVM with linear kernel) were able to fit training samples when applied to data with certain non-linear boundaries.
It is often difficult to explain in simple terms how different machine learning algorithms work, especially to a non-technical audience. To address this issue, I have created the R package mlsim as a teaching tool, which first simulates different data patterns and then demonstrates the ability of various Machine Learning algorithms to fit to the data. The mlsim, package is also a useful tool for data scientists trying to build a spatial intuition of different Machine Learning algorithms.
The mlsim program randomly generates data points marked as little triangles (T) and circles (C) representing two slightly overlapping classes mostly separated by a shaped boundary, such as a circle, square, sinusoid etc. (a circle or square may not exactly be rendered as such depending on the aspect ratio of you plotting area).
mlsim can be downloaded from: https://github.com/jacekko/mlsim_zip/blob/master/mlsim_0.0-1.zip.
To install mlsim place the R binary build file mlsim_0.0-1.zip in your working R directory and run install.packages("mlsim_0.0-1.zip", repos=NULL). After loading the package with library(mlsim), the package can be run by typing mlsim() and following the prompts. mlsim has been written in such a way that instead of automatically downloading all the dependencies the user may choose to skip some of the suggested libraries in which case the corresponding options will not appear on the list of algorithms available for demonstration.
The program flow is shown in the figure below.
After the selected algorithm is executed, the diagnostic information generated by ConfusionMatrix from package caret is printed along with a graphical output consisting of two plots, as per the second figure below. The plot on the left-hand side presents the randomly generated data; while the right-hand plot presents the result of the selected data fit. In this example, logistic regression completely fails to separate the classes, and instead classifies everything as C i.e. circles.
Different results are achieved when other algorithms are used, such as recursive partitioning tree (rpart); see below.
Custom class boundaries can be created by a function of two variables x and y which must be named using a name with suffix “_sh”, e.g. the circle class boundary was set by:
circle_sh <- function (x, y, r = 3)
return(x^2 + y^2 - r^2)
where the parameter r is the radius of the circle selected in such a way that the circle fits in the predefined plotting area.
The reader may try the example below.
myshape_sh <- function(x, y)
if(y != 0)
z = x/y
z = 0
The function myshape_sh will be found by mlsim when you restart it and will appear on the list of shapes.
It should be highlighted, that although selecting an algorithm with good fit to the training data is an important part of a data science project, it does not guarantee that it will work well with a new data set. Even if an algorithm is flexible enough to capture complex data patterns in one data set, this may be due to over-fitting and may result in poor predictions when applied to a new data set. Regardless of prior performance, it is critical to split the available data into training and validation subsets and perform hyper-parameter tuning and cross-validation to minimise the likelihood of over-fitting.