.

In this post, we’ll use an unsupervised machine learning technique called kmeans clustering to find naturual structures in our data. In the other blog posts, we used supervised machine learning techniques like logistic regression and linear regression to predict car prices or delayed flights.

Often time, we don’t even know what the structure of our underlying data is telling us.

Let’s load in our small sample set here and see the first 5 rows of data:

## Income Lot_Size Ownership

## 1 60 18.4 owner

## 2 85.5 16.8 owner

## 3 64.8 21.6 owner

## 4 61.5 20.8 owner

## 5 87 23.6 owner

`We see the following variables are collect are Income, Lot Size, and whether that household owns a Riding Mowers or not.`

We see that our data set is evenly split between those who owsn a riding mowers and those who don’t. This is all good, but perhaps there are more subtle natural groupings that we are not aware of.

The next step in predictive analytics is to explore our underlying data. Let’s do a scatter plot of Income vs Lot Sizes.

From this it’s hard to see any natural groupings of owner or non-owners of riding mowers by Income and Lot Size of the homes.

One of the main steps in the predictive analytics is data transformation. Data is never in the way you want them. One might have to do some kind of transformations to get it to the way we need them either because the data is dirty, not of the type we want, out of bounds, and a host of other reasons

Fortunately for this data set, we will not NOT be doing any data transformation or pre-processing. Note: This is not a common thing. We usually have to do many data transformation and pre-processing before we can use it. This is just a small clean data set for expository ease.

We will use an algorithm called k-means to find the number of natural clusters in our data set.

Let’s take an initial “guess” of 3 clusters to describe out dataset:

## K-means clustering with 3 clusters of sizes 9, 8, 7

##

## Cluster means:

## Income Lot_Size

## 1 64.83333 18.53333

## 2 91.42500 19.75000

## 3 46.80000 18.57143

##

## Clustering vector:

## [1] 1 2 1 1 2 2 2 2 1 2 3 2 1 3 1 3 2 3 1 1 3 3 3 1

##

## Within cluster sum of squares by cluster:

## [1] 229.4400 960.1150 327.4743

## (between_SS / total_SS = 83.4 %)

##

## Available components:

##

## [1] "cluster" "centers" "totss" "withinss"

## [5] "tot.withinss" "betweenss" "size" "iter"

## [9] "ifault"

` `

This gives us a wealth of information. Each of our clusters have size 9,7, and 8 respectively.

We can see that in the clustering vector. Counts of 1s is 9, 2s is 7, and 3s is 8. Clustering vector:

## [1] 1 3 1 1 3 3 3 3 1 3 2 3 1 2 1 2 3 2 1 1 2 2 2 1

Additionally, the mean Income for Cluster 1 is 64K and Lot Size of 18.5K sq ft. And so forth.

Cluster means:

Income Lot_Size

1 64.83333 18.53333

2 46.80000 18.57143

3 91.42500 19.75000

There are a number of other interesting statistics included in that summary.

Rather than guessing what’s the optimal number of clusters, let’s do something more systematic. Let’s run through different sizes of k to see which one is optimal. Here we are running from k=1 to k=6 (1 cluster to 6 clusters) to see which one is a good fit for our data.

As visually nice as that was, let’s plot the within variances(sums of squares) to see how it behaves. The way we think about this is the lower the within sums of squares each cluster is the more homogeneous it is. We want the elements within our clusters to be similar(homogenous, low variances, low sums of square differences). We see that at 4 clusters we are getting as low as it goes and the fifth and sixth clusters gave very little incremental value. Diminishing returns.

Now that we’ve determined that 4 clusters is the optimal number, let’s zoom in the 4 clusters model to see it more clearly. Please kindly note each of the numbers are Household identifiers.

Here’s an alternative visualization of the 4 clusters above. This visualization clearly shows the 4 distinct clusters nicely. And nicely enough, these two variables explain all(100%) of the variablity in the data set.

What are some applications of this information? In marketing analytics, people often use these clusters as a form of customer segmentation. The number of variables that define a cluster might be more than two as in this demonstrative case. For example, we might want to segment our customers by Age, Geography, Income, Spending, Ethnic, Gender, etc. And through clustering analyses, we can segment our customer into groupings that we can target them for selling, upselling, crossselling, etc.

This is just one of many examples of the use of this unsupervised machine learning technique called kmeans clustering.

Hope you enjoyed this and are excited in applying data science models to your problem space.

In follow on blogs I’ll explain in further details the theories behind these methods and the differences and similarities between them.

- 11 data science skills for machine learning and AI
- Get started on AWS with this developer tutorial for beginners
- Microsoft, Zoom gain UCaaS market share as Cisco loses
- Develop 5G ecosystems for connectivity in the remote work era
- Choose between Microsoft Teams vs. Zoom for conference needs
- How to prepare networks for the return to office
- Qlik keeps focus on real-time, actionable analytics
- Data scientist job outlook in post-pandemic world
- 10 big data challenges and how to address them
- 6 essential big data best practices for businesses
- Hadoop vs. Spark: Comparing the two big data frameworks
- With accelerated digital transformation, less is more
- 4 IoT connectivity challenges and strategies to tackle them

Posted 10 May 2021

© 2021 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central