In this article, an *R-hadoop* (with *rmr2*) implementation of ** Distributed KMeans Clustering **will be described with a

- First the dataset shown below is
into 4 data subsets and they are copied from*horizontally partitioned**local*to, as shown in the following animation. The dataset chosen is small enough and it’s just for the*HDFS**POC*purpose, but the same concept can be used to cluster huge datasets.

- The partitioned dataset is to be clustered into
and the first*K=3*clustersare randomly generated.**3 initial centroids** - Next, the
algorithm is**KMeans clustering***parallelized*. The algorithm consists of two key steps:

:*Cluster Assignment*

- In this step, each data point is assigned to the
*nearest cluster**center*. - This step can be carried for each data point independently.
- This can be designed using the
function (there are*Map**4 such*created) where the points from each of the 4 data subsets are parallelly assigned to the nearest cluster center (each map job knows the coordinates of the initial cluster centroids created).**map jobs** - Once each data point is assigned to a
**cluster**, the map job*centroid**emits*each of the datapoints with the the assignedas the**cluster label**.*key*

- In this step, each data point is assigned to the
:*Cluster Centroid (Re-) Computation*

- In this step, the
for each of the*centroids*are**clusters**from the points assigned to the cluster.**recomputed** - This is done in the
function, where each cluster’s data points come to the*Reduce*as a collection of all the data points assigned to the cluster (corresponding to the**reducer**emitted by the**key**function).**map** - The reducer
the**recomputes**of each*centroid*, corresponding to each key.**cluster**

- In this step, the

The

above are**steps 1-2****repeated till**, so this becomes a*convergence*.**chain of map-reduce jobs**The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.

- The next animation shows the
*first 5 iterations*of the.*map-reduce chain*

- Every time the cluster-labels assigned to each of the points in each of the data subsets are obtained from the corresponding
*map job.*

- Then the
**updated (recomputed) cluster**are obtained from the corresponding*centroids*for each of the clusters, in the same iteration.**reduce job**

- Every time the cluster-labels assigned to each of the points in each of the data subsets are obtained from the corresponding

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central