Here, I’ve used the famous Iris Flower dataset to show the clustering in Power BI using R. I’ve used the K-means clustering method to show the different species of Iris flower.
About the dataset: The Iris dataset has 5 attributes (Sepal length, Sepal width, Petal width, Petal length, Species). The 3 different species are named as Setosa, Versicolor and Virginica. It is observed that, the Petal Length and Petal Width are similar in each Species, hence I have considered Petal Length for x axis and Petal Width for y axis to plot a graph.
K-means Clustering: K means is a non-hierarchical iterative clustering technique.In this technique we start by randomly assigning the data points to clusters. We know that there are 3 different species in our data set, so I have taken 3 clusters. The algorithm will start assigning each data points to these 3 clusters. Then it calculates the distance between each data point to the assigned cluster centroids using ‘Eluclidian Space’. According to the distance rearrange the centroid. This process is done iteratively until the clusters become stable and there are no data points to be moved.
R visual: In the visual we can see the how the species are separated after clustering. Here 1 is Setosa, cluster 2 is Versicolor and cluster 3 is Virginica. We can also see that algorithm wrongly assigned few data points in Versicolor and Virginica.
Drawback: We see that after clustering few data points belonging to Setosa are seen in Versicolor and vice-versa. However this clustering is more suitable for unsupervised learning and when we have a large dataset.
iris<- kmeans(dataset[ ,3:4], 3, nstart=20)
ggplot(dataset, aes(PetalLength, PetalWidth, color = Clusters)) + geom_point(shape = 17, size = 4)