Preliminary findings about Zipf's Law (the thick tail distribution)

In this blog post I summarize my current findings in the Zipf's law clustering process.

I have done some extensive simulations on clustering of entities, which should lead to Zipf's law..

To summarize the simulated clustering process (which complies to the guidelines dr. Granville provided in his post):

start with a vector with all 1's with length 1e6
in a for loop do:
  select to random integer numbers from the set of vector indices
  if the corresponding clusters are similar in size do
    if a bernouilli sample with probability p=0.8 is equal to one do
       combine clusters (sum cluster 1 and 2, put in the vector at position of 1, remove position 2)
  else do (what if not similar in size)
    if a bernouilli sample with probability q=0.3 is equal to one do
      combine clusters (sum cluster 1 and 2, put in the vector at position of 1, remove position 2)

repeat for loop for 20e6 iterations, or stop if the vector of clusters has length one.

I define the similarity in size of clusters in a very straightforward way:

  • divide the cluster sizes: size1 / size2
  • they are similar if size1 / size 2 is bigger than 1/X and at the same time size1 / size2 is smaller than X

In the similations I have done X was 1.1 and X was 3. When X is 3, this means a lot more cluster combinations are seen as similar, and thus the combinations of clusters go more rapidly in time (I define time as the iterations of the for loop), as p>q.

This way, we end up with a vector size of 1 after approx. 1.4e6 iterations of the for loop when X was 3 and 2.2e6 when X was 1.1 (approx. because the clustering is a stochastic process). After every 1e4 iterations I save the current state of the vector of clusters and I plot them subsequently in a .GIFs, which you can see below. In these .GIFs we have size as amount of entities in the cluster, frequency the amount of times a cluster with a corresponding size appears in the vector of clusters and time the iteration of the for loop.

We immediatly notice the linear trend between log(frequency) and size, except at the very end of the time scale. In the end the clustering process leads to a cluster that has all entities. I guess this could be compared to a monopoly or a sort of black hole. This is also where I see some problems with the analogy with user of a social network, f.e. users of DSC and their company affiliation, as actual clustering between companies seems more farfetched here. To have more flexibility, I would insert an extra hierarchy in the model: a clustering between companies and a clustering of members in the network within one company. The differences, if any, with the one-level model could be interesting to see.

Moving forward, I believe the crux of this story lies in the definition of "similarity" between two clusters and the values of p and q. At this moment I believe p (0.8) and q (0.3) are defined a little too high, although in an actual clustering process the correctness of the definition of p and q would also depend on the resolution of the "time"-variable. In terms of similarity there are a lot of possibilities and it would be interesting to see if these all lead to distinct clustering phenoma, other "similarities" could be the absolute difference in sizes of two clusters or perhaps adding an extra variable to each cluster that is updated in terms of time (i.e. the cost structure of a company that is reorganised after some time or stars that age) and f.e. only define similar cluster if this extra variable is similar. This extra variable also introduces an extra hierarchy in the model.

As for the parallelisation of these simulations, in essence parallelising the for loop, especially in a MapReduce context, we can look at distributed MCMC: see f.e. herehere and here. It should be noted that this parallelisation is not very straightfoward as each step depends on the previous one (which is in essence inherent to MCMCs).

All these simulations where done in R, I can of course share the code if wanted. Any comments/feedback are more than welcome. In terms of scientific work, this might lead to some sort of publication.

The clustering process for X = 1.1

The clustering process for X = 3:

Views: 756


You need to be a member of Data Science Central to add comments!

Join Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service