Subscribe to DSC Newsletter

When the k-means clustering algorithm runs, it uses a randomly generated seed to determine the starting centroids of the clusters. wiki article

If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed will not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. An example for such a behavior is shown.

R is used for the experiment. The code to load the data and the contents of the data are as follows. We try to group the samples based on two feature variables - age and bmi.

## 'data.frame': 1338 obs. of 2 variables:
## $ age: int 19 18 28 33 32 31 46 37 37 60 ... ## $ bmi: num 27.9 33.8 33 22.7 28.9 ...
## age bmi
## Min. :18.0 Min. :16.0 ## 1st Qu.:27.0 1st Qu.:26.3 ## Median :39.0 Median :30.4 ## Mean :39.2 Mean :30.7 ## 3rd Qu.:51.0 3rd Qu.:34.7 ## Max. :64.0 Max. :53.1
plot(data$age, data$bmi)

As we an see from the above plot, the data points are distributed almost evenly all over the scatter plot. The initial cluster center position would affect the final cluster shapes and memberships. We run the clustering 4 times to group this data as 4 clusters and plot the clusters outputs here.

par(mfrow=c(2,2)) for (i in 1:4 ) { clusters<- kmeans(data,4) plot(data$age, data$bmi, col=clusters$cluster) }

Each time the clustering algorithm runs, it is going to pick a random seed and that seem to impact the shapes and memberships of the clusters. The first two runs generate the same groups, but the next 2 give different groupings of the data. Setting the seed explicitly to a specific value is required to generate the same results every time.

Views: 16496

Reply to This

Replies to This Discussion

I agree with Adam. It looks like the data is essentially random but uniformly so. How was this data generated? I thought with K-Means clustering you choose the first center point based on your knowledge of the data/problem space?

Adam Alloul said:

I would have thought that if your results are dependent on the seed, then you have to add/remove features. Clearly in the dataset you used, there are no visible clusters.


© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service