This article was written by Kasper Fredenslund.
At the end of the post you will know how to:
Once we have downloaded the data, the first thing we want to do is to load it in and inspect its structure. For this we will use pandas.
Pandas is a python library that gives us a common interface for data processing called a DataFrame. DataFrames are essentially excel spreadsheets with rows and columns, but without the fancy UI excel offers. Instead, we do all the data manipulation programmatically.
Pandas also have the added benefit of making it super simple to import data as it supports many different formats including excel spreadsheets, csv files, and even HTML documents.
Source for picture:click here.
Suppose you want to predict a house's price from a set of features. We can ask ourselves if it's really important to know how many lamps, and power outlets there are; is it something people think about when buying a house? Does it add any information, or is it just data for the sake of data?
Adding a lot of features that don't contain any information makes the model needlessly slow, and you risk confusing the model into trying to fit informationless features. Furthermore, having many features increases the risk of your model overfitting (more on that later).
As a rule of thumb, you want the least amount of features that gives you as much information about your data as possible.
It's also possible to combine correlated features such as number of rooms, living area, and number of windows from the example above into higher level principal components, for example size, using combination techniques such as principal component analysis (PCA). Although we won't be using these techniques in this tutorial, you should know that they exist.
One useful way of determining the relevance of features is by visualizing their relationship to other features by plotting them. Below, we plot the relationship between two axis using the
plot.scatter() subclass method.
Now that we have selected the features we want to use (
PetalWidthCm), we need to prepare the data, so we can use it with sklearn.
Currently, all the data is encoded in a DataFrame, but sklearn doesn't work with pandas' DataFrames, so we need to extract the features and labels and convert them into numpy arrays instead.
Separating the labels is quite simple, and can be done in one line using
When considering our data a Random Forest classifier stands out as being a good starting point. Random Forests are simple, flexible in that they work well with a wide variety of data, and rarely overfit. They are therefore often a good starting point.
One notable downside to Random Forests is that they are non-deterministic in nature, so they don't necessarily produce the same results every time you train them.
While Random Forests are a good starting point, in practice, you will often use multiple classifiers, and see which ones get good results.
You can limit the guesswork over time by developing a sense for which algorithms generally do well on what problems; of course, doing a first principles analysis from the mathematical expression will help with this as well.
Now that we chosen a classifier, it's time to implement it.
Implementing a classifier in sklearn follows three steps.
Even though we can see the test accuracy lies at 98%, it would be interesting to see what kind of mistakes the model makes.
There are two ways a classification model can fail to predict the correct result; false positives, and false negatives.
Since we are not running a binary classifier (one which predict "yes" or "no"), but instead a classifier that guesses which of a series of labels, every mistake will both be a false positive with respect to some labels, and a false negative with respect to others.
In machine learning, we often use precision and recall instead of false positives and false negatives.
Precision attempts to reduce false positives whereas recall attempts to reduce false negatives. They are both a decimal number, or fraction, between 0 and 1 where higher is better.
Currently, our Random Forests classifier just uses the default parameter values. However, for increased control, we can change some or all of the values.
One interesting paramter is
min_samples_split. This parameter denotes the minimum samples required to split the decision tree.
Genereally speaking the lower it is the more detail the model captures, but it also increases the likelyhood of overfitting. Whereas if you give it a high value, you tend to record the trends better while ignoring the little details.
The read the full article with source code, click here.