Subscribe to DSC Newsletter

How to Automatically Determine the Number of Clusters in your Data - and more

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.

For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. 

How many clusters here? (source: see here)

In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. 

A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:

  • Model fitting techniques: an example is using a mixture model to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see here.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not  made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. 
  • Visual techniques: for instance, the silhouette or elbow rule (very popular.)

In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0.  Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.

The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)

A good reference on the topic is this article. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.

1. Automating the elbow rule

This is another example showing how data science can automate some tasks performed by statisticians, in this case in the context of exploratory data analysis. The solution is actually pretty simple, and applies to many problems not even related to clustering, that we will discus in the next section. Also, it is not limited to using the percentage of unexplained variance (Y- axis) to plot the elbow curve, but other criteria such as entropy, or error resulting from model fitting, work equally well. Indeed the solution provided here was designed to be integrated in black-box decision systems. The only requirement is that the elbow curve most be positive (above the X-axis) and decreasing. 

Elbow strength (with spreadsheet illustration)

The formula to compute the elbow strength (the core concept in this article) is illustrated using the table below (corresponding to to the figure in section 1) and available in our interactive spreadsheet (download the spreadsheet here).
The Delta 1 column in the table represents the differences between k and k + 1 clusters, measured on the criterion metric (the second column.) Delta 2 represents the difference computed on Delta 1, that is, the second-order differences. The strength (rightmost column) at line k (k is the number of clusters) is computed as the difference between Delta 2 and Delta 1, at line k +1. It is shown only if it is positive. 

Number of clusters

The optimum number of clusters is the value of k that maximizes the strength. That's it!

The strength tells, for a specific k, if there is a potential elbow at level k (corresponding to k clusters), and how strong the elbow signal is at that level. Sometimes the strongest signal is not the first one, though this is usually the case in many instances. Below is an example where this is not the case.

The above picture exhibits a situation where the data could conceivably have 2 or 3 clusters. However, the assumption of 3 clusters (instead of 2) is much more plausible, based on the height of the red bars. Rather than using the strength of the elbow, I actually used the relative strength: it is the strength, divided by k (the number of clusters). The relative strength dampens the strength of the elbow for large values of k, as these are usually less meaningful. 

Testing 

Three types of tests are worth doing to further assess the value of this method.

  • Test with various elbow curves: we created curves, with multiple elbows or barely any elbow, to check the results produced by our procedure. We did not find counter-examples. Some of the test curves are included in our spreadsheet. Interestingly, if the shape of the elbow curve is like 1 / k, then two clusters are detected, which conforms to intuition. If is is decreasing at a much smaller pace, then the curve is too smooth to produce red bars, and no elbow is detected. This also conforms to intuition.
  • Test on real data: these tests can be more difficult to interpret, since in many cases, nobody can tell the number of clusters, unless clusters are well separated, or known a-priori.
  • Test with simulated data: it is easy to generate data with a known number of clusters, see here. Then one can use a criterion, such as percentage of unexplained variance, and look at the elbow curve, to check when it correctly predicts the number of clusters. Below is an example of simulated clusters.

Simulated cluster structure to test the elbow rule (see here for source code)

Stopping rule for clustering algorithms

One open question is how the methodology performs when the data has more than two dimensions. The issue is not with the elbow curve itself, but with the criterion being used. Finally, when large clusters are found in a data set (especially with hierarchical clustering algorithms) it is a good idea to apply the elbow rule to any big cluster (split the big cluster into smaller clusters), in addition to the whole data set. In practice, once you hit the first red bar (or if there is another red bar just after the first one, and bigger than the first one), you can stop refining and splitting your clusters: you have reached an empirical optimum. 

2. Patenting the concept, other applications

The elbow rule can be used in various applications, not just to detect the number of clusters. We used it to detect how many decimals are correctly computed when using high precision computing libraries in Perl and Python, for a specific problem. You can check it out in my book on applied stochastic processes (available here) page 48. I also discuss the elbow rule in my optimum data binning procedure, see here. In time series, the elbow is sometimes referred to as a change point, signaling a change in the slope, and the elbow method can be used to identify these change points. 

In fact, the elbow method can be used in any algorithm that has a stopping rule, where the criterion used to measure performance improvement at each new iteration, is a positive decreasing function. In particular, it can be used to detect how deep a decision tree should be (when to stop splitting nodes), or in numerical algorithms to detect when the accuracy level reached is good enough, and no longer steadily improving when adding more iterations.

After searching the literature for the formula used in the elbow rule, and not finding anything, I decided to look whether some companies had filed patents on this topic. Interestingly, I found the following ones, among others:

Yet none of them mention the elbow rule, even though Microsoft's patent has a section on Automatic Determination of the Number of Hidden Clusters. Surprisingly, it has also a section on collaborative filtering. The patent was filed in 1998 and issued in 2004. 

Chances are that you might still be able to file a patent about my elbow rule methodology, though it is now public domain because of this very article (and the formula was mentioned first in one of my previous articles in 2011, here.) If a company is interested in patenting the concept with me, I can be reached at [email protected] I have been granted patents pertaining to data science in the past, see here. I am currently working on a model-free, data-driven methodology to determine the optimum sample size with great accuracy. To not miss my next article on this topic, sign-up here to receive our newsletter. For inquiries about patenting the concept discussed in this article, or any other one, including my upcoming article about sampling, email me.

To not miss this type of content in the future, subscribe to our newsletter. For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn, or visit my old web page here.

DSC Resources

Follow us: Twitter | Facebook

Views: 19471

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Jnanendra Sarkar on March 29, 2019 at 11:34am

Nice article. I have studied a lot of research papers regarding clustering. There is no magic algorithm found yet to get the actual number of clusters as it depends on many factors as below (not limited).

1) In any unsupervised method, there is no thumb rule to start with. Hence, we start with a random selection of cluster centres. However, there are many strategies, but no fixed strategy. Therefore, the initial selection of the centre is an important factor.

2) An important selection of features is also an important factor.  Not all available features within datasets contribute to having good clustering. Moreover, there can be some features missing - which are important contributors. 

3) Most importantly the nature of different data set in the universe is not the same.  Most of the cases, they are also not having well-separated clusters, because of uncertainty, overlapping within the dataset. To handle such issue, there are many types of research and a lot of techniques in the area of Fuzzy, Rough, Possibilistic etc. Fuzzy also has a various level (type 1, 2....n) to handle various nature of data. Each of them has some limitation too. This means, there is no fixed strategy to handle all kinds of data as there nature and hidden uncertainty, overlapping properties are also different.

4) Selection of distance metric is another important factor due to the different nature of the data set. On this regard, a couple of years ago, I published a journal research paper where I experimented with various distance metrics and the impact on the outcome. Due to that research, I went through many already published literature (referred in the bibliography) too where similar research was conducted. My focus was on the categorical dataset.

Therefore, as of today still, people are doing research. However, as a practice, ensemble techniques and evolutionary methods are used. However, those are time and resource consuming.

Comment by Klaus Wassermann on March 14, 2019 at 9:32pm

Hi

It is amazing that still today there is this basic existential and Laplacian misconception around that there is sth like an appriori optimal number of clusters. Yeah, well in text book examples or in a highly restricted scientific setup it seems to be applicable. Yet, in real world examples there is always a purpose, and any given data set can be "analyzed" serving different purposes. To neglect this and instead inisisting on "optimal number of clusters" is deeply reductionist in the wrong way. Once you set the purpose you also implicitly have set your risk preferences, i.e. the error tolerance, which in turn is given by the implied costs. And from here it is easy to find the "optimal number of clusters": It simply does not matter. It depends on the tool and on the clustering algorithm. Ultimately however, only 2 questions and their answers are relevant: What is the predictive power of the model, and how robust is it?

I readily admit that there in the context of basic research, where there seems to be no "purpose" other than understanding, one may feel inclined to search for the number of clusters in the data. For instance to get an impression of "what is in the data" But this is a fallacy. Even in this case there is no Leibnizian pre-established harmony that would justify the search of an "optimal number of clusters", i.e. determining the "objective" classification that would exist independent of the observer-analyst. Unfortunately there is not enough space here to discuss this in detail. But you may check the decades old discussion in the philosophy of science.

Comment by Eric A. King on March 14, 2019 at 8:19am

Correct # of Clusters ≠ Maximum elbow strength or any other artificial technical metric.

Correct # of Clusters = The (functionally reasonable) number that the client / user / process can manage. 

Correct # of Clusters = As defined in the project definition after interviewing those who contribute to- or benefit from- the analytic function.

Correct # of Clusters = Maximum benefit when technical metrics are translated to organizational impact metrics.

Comment by Jeff Altman on March 14, 2019 at 8:12am

Comment by Jeff Altman just nowDelete Comment

Hello - I found the following and have used it with excellent results: https://github.com/arvkevi/kneed and https://pypi.org/project/kneed/

Python Code example:

cluster_range = range(1, 10) #test for cluster sizes 1 to 10
cluster_errors = [] #create array to hold errors

for num_clusters in cluster_range:
clusters = KMeans( num_clusters )
clusters.fit(X_scaled)
cluster_errors.append( clusters.inertia_ )

clusters_df = pd.DataFrame({"cluster_errors": cluster_errors, "num_clusters": cluster_range})
clusters_df[0:10]


#programmatically determine "elbow" (optimum number of clusters) using Kevin Arvai's optimal number of clusters algorithm

from kneed import KneeLocator
elbow = KneeLocator(clusters_df.num_clusters.values, clusters_df.cluster_errors.values, S=1.0, curve='convex', direction='decreasing')
print('create a K-means cluster with ' + str(elbow.knee) + ' clusters')

create a K-means cluster with 2 clusters

Comment by Nikhil Chandra Sarkar on March 14, 2019 at 2:58am
Epsilon disk with minimum points can extract the potential clusters in 3 dimensional data.

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service