In this post, we’ll use an unsupervised machine learning technique called kmeans clustering to find naturual structures in our data. In the other blog posts, we used supervised machine learning techniques like logistic regression and linear regression to predict car prices or delayed flights.
Often time, we don’t even know what the structure of our underlying data is telling us.
Let’s load in our small sample set here and see the first 5 rows of data:
## Income Lot_Size Ownership
## 1 60 18.4 owner
## 2 85.5 16.8 owner
## 3 64.8 21.6 owner
## 4 61.5 20.8 owner
## 5 87 23.6 owner
We see the following variables are collect are Income, Lot Size, and whether that household owns a Riding Mowers or not.
We see that our data set is evenly split between those who owsn a riding mowers and those who don’t. This is all good, but perhaps there are more subtle natural groupings that we are not aware of.
The next step in predictive analytics is to explore our underlying data. Let’s do a scatter plot of Income vs Lot Sizes.
From this it’s hard to see any natural groupings of owner or non-owners of riding mowers by Income and Lot Size of the homes.
One of the main steps in the predictive analytics is data transformation. Data is never in the way you want them. One might have to do some kind of transformations to get it to the way we need them either because the data is dirty, not of the type we want, out of bounds, and a host of other reasons
Fortunately for this data set, we will not NOT be doing any data transformation or pre-processing. Note: This is not a common thing. We usually have to do many data transformation and pre-processing before we can use it. This is just a small clean data set for expository ease.
We will use an algorithm called k-means to find the number of natural clusters in our data set.
Let’s take an initial “guess” of 3 clusters to describe out dataset:
## K-means clustering with 3 clusters of sizes 9, 8, 7
## Cluster means:
## Income Lot_Size
## 1 64.83333 18.53333
## 2 91.42500 19.75000
## 3 46.80000 18.57143
## Clustering vector:
##  1 2 1 1 2 2 2 2 1 2 3 2 1 3 1 3 2 3 1 1 3 3 3 1
## Within cluster sum of squares by cluster:
##  229.4400 960.1150 327.4743
## (between_SS / total_SS = 83.4 %)
## Available components:
##  "cluster" "centers" "totss" "withinss"
##  "tot.withinss" "betweenss" "size" "iter"
##  "ifault"
This gives us a wealth of information. Each of our clusters have size 9,7, and 8 respectively.
We can see that in the clustering vector. Counts of 1s is 9, 2s is 7, and 3s is 8. Clustering vector:
##  1 3 1 1 3 3 3 3 1 3 2 3 1 2 1 2 3 2 1 1 2 2 2 1
Additionally, the mean Income for Cluster 1 is 64K and Lot Size of 18.5K sq ft. And so forth.
1 64.83333 18.53333
2 46.80000 18.57143
3 91.42500 19.75000
There are a number of other interesting statistics included in that summary.
Rather than guessing what’s the optimal number of clusters, let’s do something more systematic. Let’s run through different sizes of k to see which one is optimal. Here we are running from k=1 to k=6 (1 cluster to 6 clusters) to see which one is a good fit for our data.
As visually nice as that was, let’s plot the within variances(sums of squares) to see how it behaves. The way we think about this is the lower the within sums of squares each cluster is the more homogeneous it is. We want the elements within our clusters to be similar(homogenous, low variances, low sums of square differences). We see that at 4 clusters we are getting as low as it goes and the fifth and sixth clusters gave very little incremental value. Diminishing returns.
Now that we’ve determined that 4 clusters is the optimal number, let’s zoom in the 4 clusters model to see it more clearly. Please kindly note each of the numbers are Household identifiers.
Here’s an alternative visualization of the 4 clusters above. This visualization clearly shows the 4 distinct clusters nicely. And nicely enough, these two variables explain all(100%) of the variablity in the data set.
What are some applications of this information? In marketing analytics, people often use these clusters as a form of customer segmentation. The number of variables that define a cluster might be more than two as in this demonstrative case. For example, we might want to segment our customers by Age, Geography, Income, Spending, Ethnic, Gender, etc. And through clustering analyses, we can segment our customer into groupings that we can target them for selling, upselling, crossselling, etc.
This is just one of many examples of the use of this unsupervised machine learning technique called kmeans clustering.
Hope you enjoyed this and are excited in applying data science models to your problem space.
In follow on blogs I’ll explain in further details the theories behind these methods and the differences and similarities between them.