The Zipf's law states that in many settings (that we are going to explore), the volume or size of entities is inversely proportional to a power s (s > 0) of their ranking. This has important implications in predictive modeling, discussed below. The processes that create this type of dynamic are not well understood. It is the purpose of this article to explain the underlying mechanics. The traditional example for the Zipf distribution is the distribution of Internet domains, ranked by traffic. By its very nature, Zipf's law is also a feature of big data: it can't really be observed in small data sets.
In the following example - based on real data from the DSC member database - we will see that this law is accurate for the 3 metrics that were expected to follow the law: members per company, per city or per country. In particular, it shows that the distribution of DSC members by company, very closely follows a Zipf distribution, where the number of DSC members, for a given company, is proportional to r^{-s}, where r is the company rank, and s = 0.7, see figure 1.
Figure 1: Example of Zipf distribution with parameter s = 0.7.
The Zipf distrubution is characterized by a rank decay well modeled by an inverse power function of parameter s > 0. It applies to the following settings.
1. General Setting
There are k atoms (e.g. DSC members) distributed over m active entities (e.g. companies). By active entities, we mean (in our example) companies with at least one DSC member; those with 0 member are not taken into account. Both k and m are very large, with k much bigger than m. Indeed, generally k / m tends to infinity as m tends to infinity. A few entities are huge (IBM, Accenture, Oracle in Figure 1), yet there are tons of entities with very few atoms.
Another example where Zipf's law applies is Internet domains: a few domains like Google, Facebook, Twitter, LinkedIn, and Yahoo dominate, with Google being the big king (in terms of monthly users or pageviews) based on Alexa ranks, while billions of small websites barely get any traffic, even when combined together.
Note that if m tends to infinity and s < 1, the Zipf distribution does not make sense from a mathematical point of view, as no moment exist. It must be regarded as a stochastic process in that case, and the explanations regarding how the process is generated still do make sense. It's only a mathematical annoyance, but not causing any problems for modeling or predicting purposes.
We have done simulations to try to understand the mechanism, and our conclusion is as follows.
2. What causes some systems to abide by Zipf's law?
Zipf's law also applies to celestial bodies in the solar system, because the process is very similar to the way companies are created and evolve, involving mergers and acquisitions. Here's how it works, described in algorithmic terms, applied to companies, and celestial bodies alike. But it applies to many other frameworks as well.
How these entities may grow over time is illustrated in the following Youtube video, and explained in this article.
Evidently, this works only for certain metrics and settings meeting the requirements about k and m, as described in the previous section. That's why it works for
but not for age, salary or gender distributions. Indeed, if you perform simulations with multiple layers of clustering, you won't be able to replicate Zipf laws unless there's a rule in your simulations that favors big clusters to lump together over successive iterations. Instead, rather than a Zipf distribution, you would end up with something that looks like a rotated S curve.
3. Data
Download our spreadsheet. It features the distribution of DSC members (for a specific channel, not disclosed here) per company (see top 10 companies in Figure 1 above), city (worldwide) and country - by itself a very interesting data set.
For each of the three fields (company, city, country), model-fitting with various distributions was attempted, using both Excel trendlines and mean-square errors (which tend to not work great in contexts like these with highly skewed distributions - instead, I recommend to use this L1 metric rather than R-Squared). Nevertheless, it is obvious that the Power trendline in Excel offers the best fit in each instance, far better than linear, exponential, logarithmic or polynomial trendlines.
4. Simulations
The challenge of the week, this week, consists of doing simulations to replicate the special clustering process descr..., resulting in Zipf distributions. More specifically, we ask you to perform the following simulations to assess whether our assumptions are correct. Alternatively, a mathematical proof is OK.
Let's assume that we have k = one million atoms, for instance space dust particles.
Algorithm (write it in Perl, R, Python, C or Java; test it)
Step #1
Each particle is assigned a unique bin ID between 1 and 1,000,000. Each particle represents a cluster with one element (the particle), and the bin ID is its cluster ID.
Step #2
Iteration: repeat 20,000,000 times:
Once the algorithm stops, the final cluster configuration represents the current solar system, or companies in US - most very small, a few very big.
Question: can this cluster process simulation algorithm be adapted to a distributed environment, say Hadoop?
5. Implications for Predictive Modeling
Structures suspected to be generated by such clustering processes can be modeled using hierarchical Bayes models and simulations (MCMC) to fit model parameters with data, assess goodness-of-fit with Zipf distribution, and then predict the evolution of the system - or at least some of its core properties such as mean. This might require adding a time dimension in the simulations (represented for instance by the iteration, in the above algorithm).
Comment
Dear all, I just registered to let you know that I have a simple model that explains Pareto distribution (and Zipf's type of law)
http://arxiv.org/abs/1409.4857
https://twitter.com/rperezmarco/status/524385601957933056
Best regards.
Note that Zipf's law is sometimes referred to as the thick-tail distribution, for instance in the context of keyword distribution, where a few thousands popular keywords dominate, and millions of keywords are relatively rarely used. Yet these millions of low-frequency keywords, when combined together, represent a significant proportion of the volume (keyword usage). This is a big contrast with the Gaussian (short-tail) distribution.
Michael: I think, but it is only an intuition, that all these power laws result from interactions between agents or particles. Data Scientists do not need to know these concepts but may find them useful
A good question is how many "data scientists" know what Ising, percolation, turbulence, and self-organized criticality are... Always wanted to write on the connections between physics and business to make it a little more apparent :)
This reminded me of my PhD thesis which involved Dielectruc relaxation. Informally consider a collection of say interacting magnets on a grid whose alignment is determined by the alignments of their neighbours (The Ising Model for example). If the environment (heat bath) flips magnets at random then at short times the decay towards an equilibrium is a power law and at long times it is a different power law. In between the decay is exponential.
At long times the power law arises from the exchange of spins and subclusters between clusters. The exact value of the power law exponent depends on the substance involved.
Hopefully I will get time to investigate this further
Whenever you see a power law, you should think about Percolations and their different incarnations (self-organized criticality, turbulence, and so forth): http://en.wikipedia.org/wiki/Percolation_theory
Thanks A.K. for all your efforts on this subject. If you can bring more light to this mysterious but widespread distribution, I'm sure you could write a seminal paper on this.
Hi Vincent. Think of the extreme case of u=1.0 v=0. Then the final distribution of the sizes of clusters from an initial cluster N would give the binary expansion of N. For example, if N was 1023 then there is exactly one cluster of size 512, 256, ... , 2 and 1. This is definitely a power decay. I think the v parameter (probability that non-similar clusters merge) determines if one gets exponential decay or Zipf distribution. I am curious as to why that is.
Atabey Kaygun posted a solution to this problem. Very interesting, in the sense that it produced exponential rank decay rather than power decay (Zipf's distribution) as expected. My own simulations (that did not lump big clusters together) resulted in the exact opposite solution, with almost linear decay.
I think Atabey's solution was too agressive in grouping clusters of similar sizes, grouping them only if both sizes were nearly identical, and using a very low v=0.001 (I would think v=0.300 is more realistic, but haven't checked so far). It would be interesting to see an algorithm that groups clusters A and B even if size of A is 3 or 4 times bigger than size of B. Indeed, that's what happens in practice with planetary collisions in the solar system billions years ago, or with company mergings.
© 2018 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central