.

# Why Zipf's law explains so many big data and physics phenomenons

The Zipf's law states that in many settings (that we are going to explore), the volume or size of entities is inversely proportional to a power s (s > 0) of their ranking. This has important implications in predictive modeling, discussed below. The processes that create this type of dynamic are not well understood. It is the purpose of this article to explain the underlying mechanics. The traditional example for the Zipf distribution is the distribution of Internet domains, ranked by traffic. By its very nature, Zipf's law is also a feature of big data: it can't really be observed in small data sets.

In the following example - based on real data from the DSC member database - we will see that this law is accurate for the 3 metrics that were expected to follow the law: members per company, per city or per country. In particular, it shows that the distribution of DSC members by company, very closely follows a Zipf distribution, where the number of DSC members, for a given company, is proportional to r^{-s}, where r is the company rank, and s = 0.7, see figure 1.

Figure 1: Example of Zipf distribution with parameter s = 0.7.

The Zipf distrubution is characterized by a rank decay well modeled by an inverse power function of parameter s > 0. It applies to the following settings.

1. General Setting

There are k atoms (e.g. DSC members) distributed over m active entities (e.g. companies). By active entities, we mean (in our example) companies with at least one DSC member; those with 0 member are not taken into account. Both k and m are very large, with k much bigger than m. Indeed, generally m tends to infinity as m tends to infinity. A few entities are huge (IBM, Accenture, Oracle in Figure 1), yet there are tons of entities with very few atoms.

Another example where Zipf's law applies is Internet domains: a few domains like Google, Facebook, Twitter, LinkedIn, and Yahoo dominate, with Google being the big king (in terms of monthly users or pageviews) based on Alexa ranks, while billions of small websites barely get any traffic, even when combined together.

Note that if m tends to infinity and s < 1, the Zipf distribution does not make sense from a mathematical point of view, as no moment exist. It must be regarded as a stochastic process in that case, and the explanations regarding how the process is generated still do make sense. It's only a mathematical annoyance, but not causing any problems for modeling or predicting purposes.

We have done simulations to try to understand the mechanism, and our conclusion is as follows.

2. What causes some systems to abide by Zipf's law?

Zipf's law also applies to celestial bodies in the solar system, because the process is very similar to the way companies are created and evolve, involving mergers and acquisitions. Here's how it works, described in algorithmic terms, applied to companies, and celestial bodies alike. But it applies to many other frameworks as well.

1. Birth: A bunch of tiny, one-man companies exist initially, long ago (comparable to hydrogen molecules, or space dust, if we use the celestial body analogy)
2. They cluster due to some attraction forces
3. They cluster even more
4. They cluster and cluster again; big clusters tend to merge with other big clusters, but plenty of tiny companies from step 1 (or remnants of bankrupties/planetary collisions) are still there and still tiny; some tiny ones are absorbed by big ones
5. And they cluster even more
6. At some point, we have a few giants (IBM, Microsoft, Walmart - comparable to the Sun, Jupiter or Saturn in our analogy) and still plenty of tiny ones (DSC, comparable to a 50-feet wide space rock, that is definitely getting bigger)

How these entities may grow over time is illustrated in the following Youtube video, and explained in this article

Evidently, this works only for certain metrics and settings meeting the requirements about k and m, as described in the previous section. That's why it works for

• companies, cities, and states - ranked by number of DSC nembers,
• Internet domains - ranked by traffic,
• celestial bodies - ranked by mass,
• English words - ranked by frequency,

but not for age, salary or gender distributions. Indeed, if you perform simulations with multiple layers of clustering, you won't be able to replicate Zipf laws unless there's a rule in your simulations that favors big clusters to lump together over successive iterations. Instead, rather than a Zipf distribution, you would end up with something that looks like a rotated S curve.

3. Data

Download our spreadsheet. It features the distribution of DSC members (for a specific channel, not disclosed here) per company (see top 10 companies in Figure 1 above), city (worldwide) and country - by itself a very interesting data set.

For each of the three fields (company, city, country), model-fitting with various distributions was attempted, using both Excel trendlines and mean-square errors (which tend to not work great in contexts like these with highly skewed distributions - instead, I recommend to use this L1 metric rather than R-Squared). Nevertheless, it is obvious that the Power trendline in Excel offers the best fit in each instance, far better than linear, exponential, logarithmic or polynomial trendlines.

4. Simulations

The challenge of the week, this week, consists of doing simulations to replicate the special clustering process descr..., resulting in Zipf distributions. More specifically, we ask you to perform the following simulations to assess whether our assumptions are correct. Alternatively, a mathematical proof is OK.

Let's assume that we have k = one million atoms, for instance space dust particles.

Algorithm (write it in Perl, R, Python, C or Java; test it)

Step #1

Each particle is assigned a unique bin ID between 1 and 1,000,000. Each particle represents a cluster with one element (the particle), and the bin ID is its cluster ID.

Step #2

Iteration: repeat 20,000,000 times:

• Randomly select two integers i and j between 1 and current number of clusters.
• If size of cluster i and cluster j are similar, merge these clusters with probability p > 0.8, Otherwise, merge these clusters with probability q < 0.3
• Update cluster list

Once the algorithm stops, the final cluster configuration represents the current solar system, or companies in US - most very small, a few very big.

Question: can this cluster process simulation algorithm be adapted to a distributed environment, say Hadoop?

5. Implications for Predictive Modeling

Structures suspected to be generated by such clustering processes can be modeled using hierarchical Bayes models and simulations (MCMC) to fit model parameters with data, assess goodness-of-fit with Zipf distribution, and then predict the evolution of the system - or at least some of its core properties such as mean. This might require adding a time dimension in the simulations (represented for instance by the iteration, in the above algorithm).

Views: 23105

Tags: predictive modeling

Comment

Join Data Science Central

Comment by Ricardo Perez-Marco on June 5, 2015 at 5:41am

Dear all, I just registered to let you know that I have a simple model that explains Pareto distribution (and Zipf's type of law)

http://arxiv.org/abs/1409.4857

Best regards.

Comment by Vincent Granville on September 22, 2014 at 10:02am

Note that Zipf's law is sometimes referred to as the thick-tail distribution, for instance in the context of keyword distribution, where a few thousands popular keywords dominate, and millions of keywords are relatively rarely used. Yet these millions of low-frequency keywords, when combined together, represent a significant proportion of the volume (keyword usage). This is a big contrast with the Gaussian (short-tail) distribution.

Comment by Alexander Kashko on September 4, 2014 at 8:11am

Michael: I think, but it is only an intuition, that all these power laws result from interactions between agents or particles. Data Scientists do not need to know these concepts but  may  find them useful

Comment by Michael Kremliovsky on September 2, 2014 at 8:49am

A good question is how many "data scientists" know what Ising, percolation, turbulence, and self-organized criticality are... Always wanted to write on the connections between physics and business to make it a little more apparent :)

Comment by Alexander Kashko on September 2, 2014 at 5:42am

This reminded me of my PhD thesis which involved Dielectruc relaxation. Informally consider  a collection of say  interacting magnets on a grid whose alignment is determined by the alignments of their neighbours (The Ising Model for example). If the environment (heat bath) flips magnets at random then at short times the decay towards an equilibrium is a power law and at long times it is a different power law. In between the decay is exponential.

At long times the power law arises from the exchange of  spins and subclusters between clusters.  The exact value of the power law exponent depends on the substance involved.

Hopefully I will  get time to investigate this further

Comment by Michael Kremliovsky on August 25, 2014 at 11:15am

Whenever you see a power law, you should think about Percolations and their different incarnations (self-organized criticality, turbulence, and so forth): http://en.wikipedia.org/wiki/Percolation_theory

Comment by Atabey Kaygun on August 24, 2014 at 10:59pm
Thanks for the great blog entry. It was fun writing the code, which took about an hour anyway. An insomniac has to find a way to keep him busy :)

As for the difference between the extreme case (u,v)=(1.0, 0.0) and what you suggest (u,v)=(0.8, 0.3): I have no idea. I wish I was as smart as you give me credit for :)
Comment by Vincent Granville on August 24, 2014 at 7:23pm

Thanks A.K. for all your efforts on this subject. If you can bring more light to this mysterious but widespread distribution, I'm sure you could write a seminal paper on this.

Comment by Atabey Kaygun on August 24, 2014 at 1:12am

Hi Vincent. Think of the extreme case of u=1.0 v=0.  Then the final distribution of the sizes of clusters from an initial cluster N would give the binary expansion of N. For example, if N was 1023 then there is exactly one cluster of size 512, 256, ... , 2 and 1.  This is definitely a power decay. I think the v parameter (probability that non-similar clusters merge) determines if one gets exponential decay or Zipf distribution. I am curious as to why that is.

Comment by Vincent Granville on August 23, 2014 at 8:02pm

Atabey Kaygun posted a solution to this problem. Very interesting, in the sense that it produced exponential rank decay rather than power decay (Zipf's distribution) as expected. My own simulations (that did not lump big clusters together) resulted in the exact opposite solution, with almost linear decay.

I think Atabey's solution was too agressive in grouping clusters of similar sizes, grouping them only if both sizes were nearly identical, and using a very low v=0.001 (I would think v=0.300 is more realistic, but haven't checked so far). It would be interesting to see an algorithm that groups clusters A and B  even if size of A is 3 or 4 times bigger than size of B. Indeed, that's what happens in practice with planetary collisions in the solar system billions years ago, or with company mergings.