Home » Uncategorized

Machine Learning : Unsupervised – k-means Clustering and Bootstrapping

This article is in continuation to our previous topic ‘Unsupervised Machine Learning’. Today I’m giving you another powerful tool on this topic named ‘k means Clustering‘. The work in this article is on the continuation of the previous WHO data set featured in ‘Machine Learning: Unsupervised – Hierarchical Clustering and Bootstrapping’. This artifact demonstrates implementing k means clustering and bootstrapping to make sure that the algorithm and clusters formed stand true. The bootstrapping will also evaluate whether we have the right amount of clusters or not and if not what should be the ‘k’ i.e. number of clusters based on ‘Calinski-Harabasz Index’ and ‘Average Silhouette Width (ASW)’. Further, this report relates to how closely identical results are given by k-means with respect to the hierarchical clusters which were formed in the previous post. This is important because, no matter what different algorithms I use for computing clusters, the relation between, the data will remain the same. This indicates clusters should be formed more or less the same (with minimal trade-offs) irrespective of the algorithms used by an individual as data and relationship between the data does not change. Excited?

1. Content Briefing

Following are the contents covered in this article for k-means clustering

  1. k-means theoretical implementation
  2. Implementing k-means algorithm in R
  3. Addressing optimal ‘k’ value using Calinski-Harabasz Index and Average Silhouette Width Index
  4. Bootstrapping k-means clusters for its validation and confidence

2. The Model

5.1. Data

I have used R language to code for clustering on the World Health Organization (WHO) data, containing 35 countries limited to the region of ‘America’[3]. Following are the data fields:

  1. $Country: contains names of the countries of region America (datatype: string)
  2. $Population: contains population count of respective countries (datatype: integer)
  3. $Under15: count of population under 15 years of age (datatype: number)
  4. $Over60: count of population over 60 years of age (datatype: number)
  5.  $FertilityRate: fertility rate of the respective countries (datatype: number)
  6. $LifeExpectancy: Life expectancy of people of respective countries (datatype: integer)
  7. $CellularSubscribers: Count of people of respective countries possessing cellular subscriptions (datatype: number)
  8. $LiteracyRate: Rate of Literacy of respective countries (datatype: number)
  9. $GNI: Gross National Income of respective countries (datatype: number)

5.2. K Means Clustering

K means is another popular clustering algorithm where we try to categorize data based on forming clusters. K means to find it large applications in document classifications, delivery store optimizer, identifying crime localities, customer segmentation, etc. as all the clustering algorithms. So, the question arises how is k-means different? and why do we need it? K means is one of the simplest algorithms to implement when you have the data which is all numeric and the distance metric used in squared Euclidean. It is a high ad hoc algorithm to be used in clustering with only one major disadvantage, any guess?

5.3. Theoretical Implementation – K Means Clustering
  1. Select random number of ‘k’ cluster centers for your data. (this is the disadvantage of k-means clustering that we need to pre-define the number of clusters we need, can lead to unwanted results)
  2. For the selected centers we now assign all the data points which are nearest to it forming our clusters
  3. Now, for the clusters formed we compute the centers of the clusters
  4. Further, we start reassigning the data points as per the new cluster centers we just formed in step 3.
  5. Steps 3 and 4 are iterative and should be done until no further possibility of trading data points between clusters or you may have reached to maximum number of iterations that might be set.

All of the above steps for building a k means model will be understood clearly with the following figure illustration below:

Image illustration for k means clustering

5.4. Practical Implementation

It is finally time to implement k means clustering using R language. The function to run k means clustering in R is kmeans(). The function gives the cluster attributes that includes cluster labels, the cluster centers, the total sum of the square, total WSS (within the sum of squares) and total BSS. k-means does not have a stopping point that is unique, thus the possibility of k-means being fairly unstable is high as the fact that the final cluster is dependent on the initial cluster centers. To nullify this effect we can run k-means clusters several times with different random starts and based on results we can select the one with least WSS. And the best part? kmeans() function can manage this on its own unless otherwise stated it defaults to one random start.

My model for k-means clustering is built on the ‘WHO’ Health data limited to the American region containing 35 countries. This is the same data which was used by me for implementing Hierarchical Clustering. This will also help us understand whether using different algorithms on the same set of data produces the same clusters or different? Thinking logically if the data is the same which means the relationship between the data remains the same, indicating that the weights and distances of the data points in a plane also will be the same. So irrespective of what algorithm we use we should get the same set of clusters. Will we? We’ll figure out very soon. So get ready with your R-Studio and make sure you have the data downloaded from my website. Here is the data download link: https://talktomedata.files.wordpress.com/2019/05/who3.xlsx

5.5. R Algorithm for k-means

5.5.1 Reading and Scaling Data

First things first, I am loading the WHO3.csv file in the object ‘data’ using the function read.csv(). The arguments are ‘header = T’ stating to R that the file contains headers and since it is a CSV – comma separated values, meaning data values are separated with comma, I mention in the argument as ‘sep = “,”‘.

Moving ahead I have scaled the data, which is a really important aspect to get good clusters and without any flaws. I have scaled data to set mean = 0 and with a standard deviation of 1 for all the data columns except the first one which is ‘Country’. First, I create an object ‘scaling.variables’ for all the columns except the first as I mentioned. Going ahead, I am using scale() function to scale the columns isolated except the first column. The scale function gives two attributes ‘scaled: center’ and ‘scaled: scale’ which are saved into different objects, these objects can be handy if we need to remove the data scaling.

R listing for reading and scaling the data

Printing clusters is crucial and it is not a good idea to print them with all the columns of data as this may be confusing in finding the cluster relations. You have to try combinations of data columns to make sure you are printing the right amount of columns so that you can find a relation between the clusters. To print them I am forming a function named ‘clusters_print’. This will help me print the clusters formed by kmeans() limiting the columns to “Country”, “CellularSubscribers”, “LiteracyRate” and “GNI”. Below is the listing for the clusters_print function.

Listing for function to print clusters with limited columns

5.5.2 Clustering using kmeans()

In previous hierarchical clusters, the algorithm used a number of clusters k=5 was used. I am using the same k i.e. equal to 5 clusters for k-means as well. This will help me compare whether the clusters formed will be the same or different. Moving ahead I form the object ‘kclust’ in which I am saving the output of kmeans function. The k-means is formed using ‘scaled’ data for k number of clusters along with ‘nstart=100’ and ‘iter.max=100’. As I mentioned earlier that the k-means final output depends upon the initial centers which are selected. Thus, ‘nstart=100’ will start clustering with 100 different initial start points and having 100 iterations for each i.e. ‘iter.max=100’. This is very complex computation which is carried out on ease with R’s kmeans() function. The output of k-means stored in ‘kclust’ is used to extract the clusters as kclust$cluster and this is saved in an object ‘grouped’. This grouped object is used to print our clusters using the print_cluster function which we formed earlier. Below are the listing and the output showing k-means clusters.

Listing for k-means clusters and printing them


Following can be inferred from the clusters formed above:

  1. Here, clusters formed are somewhat in relation to the geographic mapping of the countries
  2. Cluster 1 contains countries with least average GNI all being in 4 figures and literacy rate averaging 80.89
  3. Cluster 2 is a good mix of countries with 4 and 5 digits GNI along with literacy rate averaging 92.18
  4. Cluster 3 has countries with highest literacy rate with minimum being 99%
  5. Cluster 4 too has countries with good literacy rates with minimum being 90% and GNI for all the countries in cluster 4 is in 5 figures

And guess what, these are more or less the same clusters with little trade-offs which were formed in Hierarchical Clusters from my previous blog: Machine Learning: Unsupervised – Hierarchical Clustering and Bootstrapping. Below is the clusters image which was formed in Hierarchical Clustering for comparison purposes.

Clusters formed in Hierarchical clustering (refer Hierarchical clustering blog)

5.5.3. Optimal ‘k’ for Clusters

Again, the question arises; how are we sure about having 5 clusters and not 5 or 3 or may just 2? How do we know we have selected an optimal number of k i.e. clusters for clustering? This can be figured out by ‘Calinski-Harabasz Index’ and ‘Average Silhouette Width’. These both concepts help us figure out what should be our optimal number of clusters i.e. k. I am going to give you a little theoretical idea about both the concepts in depth and detailed explanation is out of scope for this article. Calinski-Harabasz Index (ch)

The Calinski-Harabasz index of a clustering is the ratio of the between-cluster variance
(which is essentially the variance of all the cluster centroids from the dataset’s grand
centroid) to the total within-cluster variance (basically, the average WSS of the clusters
in the clustering). Average Silhouette Width (asw)

Measures how well an individual data point is clustered and further estimates average distance between clusters.  The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters. Computing ch and asw

kmeansruns() another function in ‘fpc’ package of R can be used to compute Calinski-Harabasz Index (ch) and Average Silhouette Width (asw) automatically. This runs k-means over a range of k and then estimates best k. It further returns the best value of k and output of kmeans for the respective k value. Using kmeansruns() on the scaled data for ‘krange=1:10’ i.e. all values of k from 1 to 10 and criterion = ‘ch’ and ‘asw’ respectively. The kmeansruns() gives the best k value as one of its output that can be extracted as $bestk. Below are the R listing and best value of k obtained.

Listing to compute ch and asw using kmeansruns and extracting bestk

Best k values obtained from ch and asw

The Calinski-Harabasz index gives the best k-value as 2 and Average Silhouette Width gives the best k value as 3. Thus, 2 or 3 is our optimal number of ‘k’ for this data set. Now, you can question me why did I use k = 5? This is simply because in my last blog of Hierarchical Clustering k was 5 so I used the same k here too, to illustrate how close we get the same clusters irrespective of the clustering algorithm that we deploy.

5.5.4. Visual Examination of ‘ch’ and ‘asw’ indexes

As per the listing for computing ‘asw’ and ‘ch’, I have used krange = 1:10. I am now going to compute both the indexes for a range of 1 to 10 against the ‘score’ of individual indexes. The index with the highest score is returned as best ‘k’. I have computed a data-frame that combines criterion values one of the outputs of kmeasnruns() function. Further, I reshape this data frame using the melt() function of R in ‘reshape2’ library. Further using ‘ggplot2’ library I have plotted the data frame. The listing and output are shown below.

Listing for forming data-frame of combined asw and ch, later plotting them

The output for range of indexes from 1-10 of ‘ch’ and ‘asw’

The plot for both ch and asw can be seen in the above figure for a range of k value from 1 to 10. ch (red) peaks at maximum score at 2 thus returning us best k value as 2 and asw peaks best score at index value 3 thus returning us with best k value as 3.

5.5.5 Bootstrapping of k-means

Bootstrapping is the litmus test that our model needs to validate positively so that we can rely on our model and can bet upon its findings and maybe if required test them for unknown data points to predict clusters. Clusterboot() function in ‘fpc’ package does the bootstrapping by re-sampling to evaluate how stable our clusters are. It works on Jaccard co-efficient a similarity measure between sets. Jaccard coefficient values should be greater than 0.5 for all our clusters to make sure our clusters are best formed. For an in-depth explanation and understanding of bootstrapping theory refer to my blog on Hierarchical Clusters.

For the test again I am selecting k = 5 as bestk to evaluate previously formed clusters. Further, using cluster boot on scaled data with arguments as clustermethod = kmeansCBI to indicate kmeans cluster-boot, running it i.e. booting it 100 times and maximum iterations for each booting to be 100. All this to be done for krange of my best k selected i.e. 5 clusters. the seed is set equal to random number 9898 for result reproducibility. We can extract clusters formed by bootstrap in $result$partition of the bootstrap model formed. These clusters are taken in to object ‘groups’ and clusters are printed using the cluster_print function that we formed earlier. Below are the listings and the clusters formed over evaluation of 100 runs and 100 iterations respectively.

Listing for cluster-boot of k-means and printing them

Bootstrap clusters

As we can see above the clusters formed initially in k-means are same as the clusters formed here in bootstraps. This shows that we have fairly stable clusters that can be trusted and deployed in practical use as they pass the test of 100 booting and 100 iterations.

Further, our results obtained in bootstrapping can be rectified by seeing each cluster’s maximum Jaccard Co-efficient which should be greater than 0.5 value indicating stability and number of time on 100 did each cluster dissolved proving its stability. Below is the combined listing and output for Jaccard coefficient and the number of times a cluster dissolved.

Maximum Jaccard Co-efficient and each cluster dissolution (on 100)

Here, we got all the Jaccard coefficient of all the clusters above 0.6 and a good number for cluster dissolving too in 100 times. These values just give us another layer of confidence on our clusters via bootstrapping.

3. Conclusion

  1. k-means is one the best and simplest ways to compute unsupervised machine learning with clustering
  2. Always remember to scale your data preferably with mean 0 and stdv. of 1 for all the numeric data columns for getting clean results
  3. If required perform ch and asw indexing before computing k-means if you are not sure about what should be your optimal ‘k’ value for k-means clustering
  4. Always bootstrap your model by trying to produce similar results as original to be confident about your model and findings

7. Clustering Takeaways:

  1. The goal is simply to draw out similarities by forming a subset of your data.
    Good clusters indicate the data points in one cluster are similar and dissimilar to data points in another cluster
  2. Units of data field matter, different units in different respective data fields produce potentially different results, it is always a good idea to have a clear sense of data and data units before clustering
  3. You will always like the scenario of unit change of output change across categories to represent the same unit change (consistency). Thus, it becomes important that you scale your data as I have scaled with mean set 0 and standard deviation of 1.
  4. Clustering is always deployed as a data exploration algorithm and can act as a precursor to supervised learning
  5. Can be considered similar to visualization, with more iterations and interactions you can understand your data better unlike automated machine learning processes like supervised learning
  6. Many theories to attain best k value for your clusters, the best are indicated here and others too can be explored easily, but it is always wise to choose the best as per the situation

To read the original post, click here