We have discussed Zipf's law last year, as well as in one of our recent weekly challenges: this statistical distribution is a perfect model to explain so many physical or man-made phenomena. In this article, we publish a formula to measure the traffic volume for any website, including traffic volume for clusters of websites, based on Alexa rankings, leveraging Zipf's distribution. This is a great resource for any digital marketer or advertiser, as well as for data scientists who can argue about our methodology.
Our formula applies to web domains with Alexa rankings between 5,000 and 20,000,000; it has not been tested on websites with very high rank. Our analysis is based on raw data as shown in figure 1. Gaps (missing data) were filled using a 2-step data science procedure (see spreadsheet in our members-only page for details). The conclusion is that our adjusted Alexa rank, for US traffic, across all websites, is about 16,000, versus 20,000 for our main website.
More interestingly, the formula to compute traffic volume (PV for page views) for a website, versus Alexa rank (R, denoted as R or Rank in the spreadsheet and figures posted here) over the last 30 days, is
PV(last 30 days) = a * R^b,
with a = 1 billion, and b = -0.746.
Attempts to make predictions or inference about US traffic rank were unsuccessful, as the Alexa data is too noisy for this level of granularity. But at the global level, we have a great model with a fantastic R-Square (>0.96), both for raw data and when looking at a log-log chart [using log(y) and log(x)instead of y and x, see figure 3 or Excel spreadsheet].
If you download the spreadsheet, you will see that this analysis was done in a quick turnaround (less than one hour), trying to fill the gaps (missing data denoted as n/a in the spreadsheet) that were easy to estimate with robust statistics, and then re-calibrating the model to estimate leftover missing data.
In our analysis, we have gathered Alexa page ranks (US and Global) for several websites that we own, ranging in rank from below 20,000 to above 5,000,000, and trying to match this data with Google Analytics traffic statistics (unique visitors, page views, and percentage of US traffic based on sessions). While Alexa claims that their rankings are based on both unique visitors and page views, we believe that it is mostly based on page views -- an additive metric, unlike uniques. For instance, both Google Analytics and Alexa are unable to filter out easy-to-detect non-human traffic from India (especially from the Telangana province) that has 5 times more pages per session and sessions that are 10 times shorter than average. This is compounded by the fact that sessions with just one page view have a zero second duration on Google Analytics. It creates a bias in Alexa US rankings statistics, visible across many websites, not just those that we own.
Note that page views is not a robust metric, as many publishers slice their articles into multiple pages to boost page views count.
Next steps: It would be interesting to compare the Zipf's model with an exponential distribution. It was obvious in this case that a power model (equivalent to Zipf in Excel) works much better than an exponential model, though we did not perform cross-validation, being convinced by the superiority of the power model, based on our experience. Too many data scientists waste too much time in comparing models, to generate an additional < 0.5% yield on baseline. We wanted to avoid this waste of money in this experiment.