This article was written by Natasha Latysheva. Here we publish a short version, with references to full source code in the original article.
Our internal data scientist had a few questions and comments about the article:
For other articles about KNN, click here.
Here's the article (short version)
In machine learning, you may often wish to build predictors that allows to classify things into categories based on some set of associated values. For example, it is possible to provide a diagnosis to a patient based on data from previous patients.
Classification can involve constructing highly non-linear boundaries between classes, as in the case of the red, green and blue classes below:
Many algorithms have been developed for automated classification, and common ones include random forests, support vector machines, Naïve Bayes classifiers, and many types of neural networks. To get a feel for how classification works, we take a simple example of a classification algorithm – k-Nearest Neighbours (kNN) – and build it from scratch in Python 2. You can use a mostly imperative style of coding, rather than a declarative/functional one with lambda functions and list comprehensions to keep things simple if you are starting with Python. Here, we will provide an introduction to the latter approach.
kNN classifies new instances by grouping them together with the most similar cases. Here, you will use kNN on the popular (if idealized) iris dataset, which consists of flower measurements for three species of iris flower. Our task is to predict the species labels of a set of flowers based on their flower measurements. Since you’ll be building a predictor based on a set of known correct classifications, kNN is a type of supervised machine learning (though somewhat confusingly, in kNN there is no explicit training phase; see lazy learning).
The kNN task can be broken down into writing 3 primary functions:
The steps in the following diagram provide a high-level overview of the tasks you’ll need to accomplish in your code.
Briefly, you would like to build a script that, for each input that needs classification, searches through the entire training set for the k-most similar instances. The class labels of the most similar instances should then be summarized by majority voting and returned as predictions for the test cases.
The complete code is at the end of the post. Now, let’s go through the different parts separately and explain what they do.
The following sections as well as the source code are available in the original (quite long) article.
About the author (of the original article)
Natasha is a computational biology PhD student at the MRC Laboratory of Molecular Biology. Her research is focused on cancer genomics, statistical network analysis, and protein structure. More generally, her research interests lie in data-intensive molecular biology, machine learning (especially deep learning) and data science.