K-means algorithm is a popular and efficient approach for clustering and classification of data. My first introduction to K-means algorithm was when I was conducting research on image compression. In this applications, the purpose of clustering was to provide the ability to represent a group of objects or vectors by only one object/vector with an acceptable loss of information. More specifically, a clustering process in which the centroid of the cluster was optimum for the cluster and the clusters were optimum with respect to the centroids.
The dimensionality of the vectors ranged from 4 to 70 and even higher in which each cluster of N-dimensional vectors was to be represented by a single vector while minimizing a certain fidelity criterion or loss of information. The design process consisted of using a training set of vectors and to apply the results for outside the training set vectors.
Here, naturally, a couple of questions arises. First, what kind of cost function, fidelity criterion or distortion measure should be used to represent the penalty that is to be paid by our clustering process and its representation by only a single vector.
Second, how many clusters should there be and how to assign the samples to each cluster. The number of clusters to be chosen is a difficult task. Of course, if you have as many clusters as there are samples or vectors, then you have achieved this minimization. But, then, that is not considered clustering of the data. In any practical application, you have to resort to representing all the samples/vectors with a few samples/vectors and therefore create a set of cluster of vectors.
Let's discuss each of these questions.
Which Cost function, Distortion Measure or Fidelity Criterion?
The answer to the question would depend on your application. A widely used measure is the mean square error also referred to as L2-norm. The L2-norm is commonly used most applications including signal processing, image compression, and video compression. There are applications that use absolute error difference, the L1-norm, also known as Manhattan distance. There are many other error measures such as the one used for speech that uses a weighted distortion measure called the Itakura-Saito measure. The distortion measure indicates the penalty to be paid for representing the members of the clusters by a single vector. As a result, the centroid of the cluster, that represents the cluster itself, is the generalized center of gravity of the cluster. The center of gravity of the cluster refers to the mean value of all the vectors in the cluster when mean-square-error is used as the distortion measure. To generalize this to other distortion measures, we use the term generalized-center-of-gravity to represent the centroid of the cluster for other distortion measures.
The second question relates to how many clusters and how to assign samples to clusters? The answer to the second part of the question is easy since the assignment of vectors to clusters follows directly from the cost function used in your system. The number of clusters, however, is an important one and would be discussed later.
Now, how do we decide on the clusters and their representation? We may answer this question by using an optimization by iteration technique.
More specifically, in this approach, we continuously optimize
a) the clusters for the given centroids, and
b) the centroids for the clusters,
until an optimum partitioning of clusters and centroid assignment is reached.
We can represent this approach by the following algorithm.
Initialization: Given the number of centroids N and their values A(m), a distortion threshold, e>0, and the points for all clusters, set iteration number m=0 and distortion value
Given the set of centroids, find the minimum distortion partition S(m) for the centroid set A(m) and compute the resulting distortion, D(m) ,based on distortion criterion.
If (D(m-1)-D(m))/D(m) < e, halt with the A(m) and S(m) representing the final centroids and partitions. Otherwise continue.
Replace m by m+1 and find the optimum centroids A(m) for the partition S(m).
Given the current partitioning of clusters, find the optimum centroid of the clusters and go to step 1.
The value of the threshold, e, depends on designer's choice, and 0.05 may be considered as a commonly used value. The resulting partitioning and centroid would provide at least a local minimum, if not a global optimum.
Given the above algorithm, we will be able to find the best partitioning and the best representation for the partition. The important part that is missing in the above algorithm are the initial centroids and initial clusters, without which we cannot execute the algorithm. In this case, we let the cluster samples/vectors guide us to the partitioning and centroid selection using the following algorithm which is known as splitting algorithm to obtain N partitions.
Initialization: Set the number of centroids M=1, and define A(1) to be the centroid of the entire available points/vectors.
Given A(M) containing M points, perturb each a(M) point to (1+d)a(M), where d is a small number. Given the resulting 2M vectors replace M by 2M.
Run “Iterative Algorithm” for M points to obtain new A(M).
Is M=N? If so, halt with the result of the previous step as the final clustering and its representation. Otherwise, go to step 1.
The preceding algorithm creates the clustering along with the optimum centroid for each cluster. You observed that the size of the clusters using the above algorithm will be a power of 2 due to the splitting process. Of course, we can select any size for the resulting number of centroids by changing the splitting process in step 1 to perturb only one the centroids at a time, possibly the one that exhibits the maximum distortion. Using this approach the number of clusters is increased by one unit in each iteration until the desired number of clusters is achieved.
As discussed before, the final number of clusters is hard to determine. The user needs to decide how many clusters would suit the best for the application in consideration. As an alternative, we can consider growing the number of clusters until the desired threshold on total distortion for the clusters is reached.
We discussed two algorithms that provide an optimal partitioning of clusters and assignment of centroids for each cluster based on some predefined distortion measure.
I hope this discussion was helpful. Best Regards, Faramarz Azadegan, Ph.D., [email protected]