In this blog post I summarize my current findings in the Zipf's law clustering process.

I have done some extensive simulations on clustering of entities, which should lead to Zipf's law..

To summarize the simulated clustering process (which complies to the guidelines dr. Granville provided in his post):

startwith a vector with all 1's with length 1e6in a for loop do:

select to random integer numbers from the set of vector indices

if the corresponding clusters are similar in size do

if a bernouilli sample with probability p=0.8 is equal to one do

combine clusters (sum cluster 1 and 2, put in the vector at position of 1, remove position 2)

else do (what if not similar in size)

if a bernouilli sample with probability q=0.3 is equal to one do

combine clusters (sum cluster 1 and 2, put in the vector at position of 1, remove position 2)

repeatfor loop for 20e6 iterations, or stop if the vector of clusters has length one.

I define the similarity in size of clusters in a very straightforward way:

- divide the cluster sizes: size1 / size2
- they are similar if size1 / size 2 is bigger than 1/X and at the same time size1 / size2 is smaller than X

In the similations I have done X was 1.1 and X was 3. When X is 3, this means a lot more cluster combinations are seen as similar, and thus the combinations of clusters go more rapidly in time (I define time as the iterations of the for loop), as p>q.

This way, we end up with a vector size of 1 after approx. 1.4e6 iterations of the for loop when X was 3 and 2.2e6 when X was 1.1 (approx. because the clustering is a stochastic process). After every 1e4 iterations I save the current state of the vector of clusters and I plot them subsequently in a .GIFs, which you can see below. In these .GIFs we have size as amount of entities in the cluster, frequency the amount of times a cluster with a corresponding size appears in the vector of clusters and time the iteration of the for loop.

We immediatly notice the linear trend between log(frequency) and size, except at the very end of the time scale. In the end the clustering process leads to a cluster that has all entities. I guess this could be compared to a monopoly or a sort of black hole. This is also where I see some problems with the analogy with user of a social network, f.e. users of DSC and their company affiliation, as actual clustering between companies seems more farfetched here. To have more flexibility, I would insert an extra hierarchy in the model: a clustering between companies and a clustering of members in the network within one company. The differences, if any, with the one-level model could be interesting to see.

Moving forward, I believe the crux of this story lies in the definition of "similarity" between two clusters and the values of p and q. At this moment I believe p (0.8) and q (0.3) are defined a little too high, although in an actual clustering process the correctness of the definition of p and q would also depend on the resolution of the "time"-variable. In terms of similarity there are a lot of possibilities and it would be interesting to see if these all lead to distinct clustering phenoma, other "similarities" could be the absolute difference in sizes of two clusters or perhaps adding an extra variable to each cluster that is updated in terms of time (i.e. the cost structure of a company that is reorganised after some time or stars that age) and f.e. only define similar cluster if this extra variable is similar. This extra variable also introduces an extra hierarchy in the model.

As for the parallelisation of these simulations, in essence parallelising the for loop, especially in a MapReduce context, we can look at distributed MCMC: see f.e. here, here and here. It should be noted that this parallelisation is not very straightfoward as each step depends on the previous one (which is in essence inherent to MCMCs).

All these simulations where done in R, I can of course share the code if wanted. Any comments/feedback are more than welcome. In terms of scientific work, this might lead to some sort of publication.

The clustering process for X = 1.1

The clustering process for X = 3:

© 2020 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Upcoming DSC Webinar**

- DataOps: How Bell Canada Powers their Business with Data - July 15

Demand for data outstrips the capacity of IT organizations and data engineering teams to deliver. The enabling technologies exist today and data management practices are moving quickly toward a future of DataOps. DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. Register today.

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central