.

*This article was written by Kasper Fredenslund.*

At the end of the post you will know how to:

- Import and transform data from a .csv file to use with sklearn
- Inspect the dataset and select relevant features
- Train different classifiers on the data using sklearn
- Analyse the results with the intention of improving your model

Once we have downloaded the data, the first thing we want to do is to load it in and inspect its structure. For this we will use **pandas**.

Pandas is a python library that gives us a common interface for data processing called a **DataFrame**. DataFrames are essentially excel spreadsheets with rows and columns, but without the fancy UI excel offers. Instead, we do all the data manipulation programmatically.

Pandas also have the added benefit of making it super simple to import data as it supports many different formats including excel spreadsheets, csv files, and even HTML documents.

*Source for picture:click here.*

Suppose you want to predict a house's price from a set of features. We can ask ourselves if it's really important to know how many lamps, and power outlets there are; is it something people think about when buying a house? Does it add any information, or is it just data for the sake of data?

Adding a lot of features that don't contain any information makes the model needlessly slow, and you risk confusing the model into trying to fit informationless features. Furthermore, having many features increases the risk of your model overfitting (more on that later).

As a rule of thumb, you want the least amount of features that gives you as much information about your data as possible.

It's also possible to combine correlated features such as number of rooms, living area, and number of windows from the example above into higher level principal components, for example size, using combination techniques such as principal component analysis (PCA). Although we won't be using these techniques in this tutorial, you should know that they exist.

One useful way of determining the relevance of features is by visualizing their relationship to other features by plotting them. Below, we plot the relationship between two axis using the `plot.scatter()`

subclass method.

Now that we have selected the features we want to use (`PetalLengthCm`

and `PetalWidthCm`

), we need to prepare the data, so we can use it with sklearn.

Currently, all the data is encoded in a DataFrame, but sklearn doesn't work with pandas' DataFrames, so we need to extract the features and labels and convert them into numpy arrays instead.

Separating the labels is quite simple, and can be done in one line using `np.asarray()`

.

When considering our data a **Random Forest** classifier stands out as being a good starting point. Random Forests are simple, flexible in that they work well with a wide variety of data, and rarely overfit. They are therefore often a good starting point.

One notable downside to Random Forests is that they are non-deterministic in nature, so they don't necessarily produce the same results every time you train them.

While Random Forests are a good starting point, in practice, you will often use multiple classifiers, and see which ones get good results.

You can limit the guesswork over time by developing a sense for which algorithms generally do well on what problems; of course, doing a first principles analysis from the mathematical expression will help with this as well.

Now that we chosen a classifier, it's time to implement it.

Implementing a classifier in sklearn follows three steps.

- Import (I usually Google this)
- Initialization (usually self-evident from the import statement)
- Training (or fitting)

Even though we can see the test accuracy lies at 98%, it would be interesting to see what kind of mistakes the model makes.

There are two ways a classification model can fail to predict the correct result; false positives, and false negatives.

- A false positive is where something is guessed to be true when it's really false.
- A false negative is where something is guessed to be false when it's really true.

Since we are not running a binary classifier (one which predict "yes" or "no"), but instead a classifier that guesses which of a series of labels, every mistake will both be a false positive with respect to some labels, and a false negative with respect to others.

In machine learning, we often use precision and recall instead of false positives and false negatives.

Precision attempts to reduce false positives whereas recall attempts to reduce false negatives. They are both a decimal number, or fraction, between 0 and 1 where higher is better.

Currently, our Random Forests classifier just uses the default parameter values. However, for increased control, we can change some or all of the values.

One interesting paramter is `min_samples_split`

. This parameter denotes the minimum samples required to split the decision tree.

Genereally speaking the lower it is the more detail the model captures, but it also increases the likelyhood of overfitting. Whereas if you give it a high value, you tend to record the trends better while ignoring the little details.

*The read the full article with source code, click here. *

- A History and Timeline of Big Data
- AI voice technology has benefits and limitations
- Strong data governance frameworks are fuel for analytics
- Top 12 most commonly used IoT protocols and standards
- What is the status of quantum computing for business?
- How parallelization works in streaming systems
- An Eggplant automation tool tutorial for Functional, DAI
- Circular economy model enables sustainability and resilience

Posted 29 March 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central