Python: Implementing a k-means algorithm with sklearn

The below is an example of how sklearn in Python can be used to develop a k-means clustering algorithm.

The purpose of k-means clustering is to be able to partition observations in a dataset into a specific number of clusters in order to aid in analysis of the data. From this perspective, it has particular value from a data visualisation perspective.

This post explains how to:

1. Import kmeans and PCA through the sklearn library
2. Devise an elbow curve to select the optimal number of clusters (k)
3. Generate and visualise a k-means clustering algorithms

The particular example used here is that of stock returns. Specifically, the k-means scatter plot will illustrate the clustering of specific stock returns according to their dividend yield.

1. Firstly, we import the pandas, pylab and sklearn libraries. Pandas is for the purpose of importing the dataset in csv format, pylab is the graphing library used in this example, and sklearn is used to devise the clustering algorithm.

```import pandas
import pylab as pl
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA```

2. Then, the 'sample_stocks.csv' dataset is imported, with our Y variable defined as 'returns' and X variable defined as 'dividendyield'.

```variables = pandas.read_csv('sample_stocks.csv')
Y = variables[['returns']]
X = variables[['dividendyield']]```

3. The elbow curve is then graphed using the pylab library. Specifically, we are devising a range from 1 to 20 (which represents our number of clusters), and our score variable denotes the percentage of variance explained by the number of clusters.

```Nc = range(1, 20)
kmeans = [KMeans(n_clusters=i) for i in Nc]
kmeans
score = [kmeans[i].fit(Y).score(Y) for i in range(len(kmeans))]
score
pl.plot(Nc,score)
pl.xlabel('Number of Clusters')
pl.ylabel('Score')
pl.title('Elbow Curve')
pl.show()```

When we graph the plot, we see that the graph levels off rapidly after 3 clusters, implying that addition of more clusters do not explain much more of the variance in our relevant variable; in this case stock returns.

4. Once the appropriate number of clusters have been identified (k=3), then the pca (Principal Component Analysis) and kmeans algorithms can be devised.

The purpose behind these two algorithms are two-fold. Firstly, the pca algorithm is being used to convert data that might be overly dispersed into a set of linear combinations that can more easily be interpreted.

```pca = PCA(n_components=1).fit(Y)
pca_d = pca.transform(Y)
pca_c = pca.transform(X)```

From Step 3, we already know that the optimal number of clusters according to the elbow curve has been identified as 3. Therefore, we set n_clusters equal to 3, and upon generating the k-means output use the data originally transformed using pca in order to plot the clusters:

```kmeans=KMeans(n_clusters=3)
kmeansoutput=kmeans.fit(Y)
kmeansoutput
pl.figure('3 Cluster K-Means')
pl.scatter(pca_c[:, 0], pca_d[:, 0], c=kmeansoutput.labels_)
pl.xlabel('Dividend Yield')
pl.ylabel('Returns')
pl.title('3 Cluster K-Means')
pl.show()```

From the above, we see that the clustering algorithm demonstrates an overall positive correlation between stock returns and dividend yields, implying that stocks paying higher dividend yields can be expected to have higher overall returns. While this is a more simplistic example and could be modelled through linear regression analysis, there are many instances where relationships between data will not be linear and k-means can serve as a valuable tool in understanding the data through clustering methods.

Views: 47514

Tags: clustering, kmeans, pca, pylab, python, sklearn

Comment

Join Data Science Central

Comment by Michael Grogan on June 14, 2018 at 8:50am

Hi Bhanu,

You can find the link here with the dataset included: http://www.michaeljgrogan.com/k-means-clustering-python-sklearn/

Comment by BhanuTeja Kasani on June 10, 2018 at 2:23pm

Hi,

Where can I find the "sample_stocks.csv"

this link no more handles that file.(http://www.michaeljgrogan.com/kmeans-wss-clustering/)

Regards,

Bhanu.

Comment by Tim hockswender on June 24, 2017 at 6:10am

Thanks for the work and the location for the file.

Tried it and all worked OK.

Comment by Michael Grogan on June 23, 2017 at 10:41am

Hi Tilak,

Very interesting question and admittedly one I’d like to research further myself.

There is quite a bit of debate about how one should go about this, but AK-means and use of the gap statistic appear to be growing in popularity.

Here are a couple of useful links:

http://statweb.stanford.edu/~gwalther/gap

https://www.sciencepubco.com/index.php/JACST/article/view/4749/1860

Comment by Michael Grogan on June 23, 2017 at 10:22am

Hi all,

The sample_stocks.csv file is available at the following link: http://www.michaeljgrogan.com/kmeans-wss-clustering/

The above is a post for how the above example can also be replicated in R, and you'll find the dataset just at the end of the page.

Many thanks,

Michael

Comment by Tilak Mitra on June 23, 2017 at 5:22am

Michael,

In addition to locating the sample file, I was wondering whether there is an automated means of choosing the optimal K and feeding the same into the subsequent analysis.

This example assumes a manual inspection of the slope curve to find out the optimum K before feeding the value into the rest of the program.

Comment by Tilak Mitra on June 23, 2017 at 5:15am

Michael,

Comment by Shantanu Karve on June 22, 2017 at 6:01pm

I was just revisiting this well-known technique, but for feature engineering purposes and its worth noting that in python, there's a much speedier version that's useful for larger datasets. http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Mi...

Here's the blurb for it:

The `MiniBatchKMeans` is a variant of the `KMeans` algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration. These mini-batches drastically reduce the amount of computation required to converge to a local solution. In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.

Comment by Tim hockswender on June 22, 2017 at 10:44am

Hello, I'm a new member and would like to know the availability of the datafile 'sample_stocks.csv'.

Is it local to your blog or from some data repository?

Thanks.

Tim