Data science versus statistics, to solve problems: case study

In this article, I compare two approaches (with their advantages and drawbacks) to compute a simple metric: the number of unique visitors ("uniques") per year for a website. I use the word user or visitor interchangeably.

Source for picture: exploring data science

The problem seems straightforward at first glance, but it is not. It is a complex big data problem because the naive approach involves sorting hundreds of billions of observations - called transactions or page views here. It is also complicated because there's no 100% sure way to identify and track a user over long time periods: cookies and IP addresses / browser combinations both have drawbacks. It's much easier if the user is identified by an email address. Also, fake traffic needs to be detected and filtered out.

This metric is important for trending purposes and to assess the value of a company, such as Google or Facebook. A computed value that is 50% below the exact number can have terrible consequences for the company valuation and its stock price. In addition, if numbers reported by competing companies are wrong (inflated), it might be important to correctly guess the inflation rate, so that you can apply it to your company when reporting quarterly results to Wall Street analysts.

On the plus side, a rough estimate (with 10% error rate) is good enough - except if the small error results in a tiny decline 2014 versus 2013, when the reality is actually tiny growth.

Anyway, here are two mechanisms to compute the number of unique visitors: data science, and statistics.

1. The data science approach

I provide here two ways to solve the problem using data science. Let's assume that we have one billion transactions per day, corresponding to (say) 0.3 billion users per day. A transaction is defined here as one page view on the target website.

Data engineering

You just sort the 366 billion logfile transactions, by cookie or user ID. It is assumed that you have pre-sorted each day in the past. So this step consists of merging/de-duping the 365 pre-sorted data sets, say 2 at a time, to end up with 184 sorted files, each containing 2 days worth of unique users. Repeat this step to end up with 92 sorted data sets. Repeat one more time to end up with 46 sorted data sets, and so on until problem solved. This is done with Map-Reduce (e.g. Hadoop) where the yearly data is initially mapped onto 366 subsets (*Map* step), then the merge/aggregate corresponding to the *reduce* step. Note the merging and sorting two pre-sorted data sets is very easy and is O(n), not O( n log n) unlike traditional sort, where n is the number of transactions (how to do it? - this is a good job interview questions for data scientists, and the solution is easy to find on the web on stackexchange.com, or in my book). The drawback of this approach is that it is more complicated than it should be; it's just like killing a fly with a nuke.


Sample N = 1,000 users (less than one out of 10 million users) and extract all their transactions. Compute in your sample the proportion of users with one page view over the 366-day time period. Denote this proportion as p(1). Likewise compute p(2) - the proportion of users visiting two page views during the 366-day time period - and then p(3), p(4) and so on. Denote as Q the quantity Q = SUM{ n*p(n) } over all n = 1, 2, etc. Let V be the total number of pageviews (transactions) over the time period in question; V is very easy to compute because pageviews - unlike unique users - is an additive metric. The number of uniques over the 366-day time period is then U = V * N/Q. Note that f = N/Q is a very interesting quantity, telling you how many page views on average, a user visits in a given time period. It depends only on L, the duration of the time period in question, if traffic is flat. It should be plotted for various values of L.

This way of solving the problem requires only a one-time analysis that can be performed in a couple of hours despite the massive volume of data (petabytes), unlike the previous approach which was more database-intensive and involved summarizing data every day. Comparing the results obtained with sampling, versus the previous (exact) solution is likely to show very little discrepancy, less than a 0.5% error.

Note 1: Sampling users must be done correctly, and it is different from sampling page views (the latter will result in biased statistics, as heavy users would be over-sampled). The easiest way is to have the user ID field stored as a sequence of successive integers (increased by one for each new user), and extract one out of every 10 or 20 million of these integers (representing users) with at least one page view for the time period in question. Then, count page views, for each of these users, for the time period in question.

Note 2: Using various values of L (L computed for 24 hours, one week, and four weeks) and extrapolating it to one year, provides yet another data science solution to this problem, using 12 times less data than the sampling approach, albeit less accurate.

2. The statistical science approach

This approach uses statistical modeling: survival models, churn, birth and death processes, Markov processes with states including "new user", "active user" and "dead user", as well as sampling, to better understand the mechanisms at play, and particularly, the trends. Some advanced statistical models might even include events (event modeling) that significantly impacted user growth, such as merger, change in the definition of user (to include international users) etc. I believe that this is too much modeling, and the statistician will spend many days to get results that won't be better than those obtained via the data science approach. It is worth the effort only if

  • there is an incentive to understand deeper characteristics about user growth or decline. 
  • there is confidence in the model, in the sense that it will still be applicable in 2015 with only minor changes in parameters, maybe detected via machine learning (a sub-field of data science, also overlapping with statistics)
  • the model easily adapts to other time frames of different lengths

Related articles  

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 12914

Tags: predictive modeling


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Alec Zhixiao Lin on September 17, 2017 at 7:36am

I have been a statistician for over 15 years.  If asked to approach the problem, i would choose the method of sampling suggested in the article.  This is not to claim that data scientists know nothing about sampling.  Not sure how survival analysis can apply here, but I agree with the author that statistical modeling is meant for knowledge discovery.  A statistical model needs to be explainable in a white-box manner. 

Comment by Herbert L Roitblat on October 30, 2014 at 11:49am

Am I missing something?  Why would you sort those lists?  Why not use a hash table?  Whatever solution, you need to have a unique identifier for each unique visitor.  After that, you can count the number of days each unique visitor visits (unique visitor days), count the number of unique visitors over a time period, or anything else.

Comment by Vincent Granville on October 30, 2014 at 8:02am

Here's one of the comments I posted on the LinkedIn version of this article:

I am an ex-statistician. It's not like my PhD in stats (computational statistics, 1993 - image analysis) was revoked, I exited myself when I found that most people calling themselves statisticians worked in narrowly defined sectors: government, surveys, clinical trials and related, banks/insurance, using very specific methods which I no longer use (ANOVA, p-value, GLM, logistic regression etc.)

Many of the statistical methods that I still use (confidence intervals, random number generation) are so different from what I learned in grad school, that it's like comparing apples with oranges. For instance, I use model-free confidence intervals not underlined by any statistical models.

When looking for a job, it (more often than not) creates confusion among recruiters/hiring managers if I call myself statistician. Data scientist, machine learning guy or analyst resonates better, as it fits better with the kind of stuff I'd be working on, if hired. Ironically, now I could call myself statistician again, as I will almost never again look for a job (being a successful, happy, stress-free entrepreneur). But statistics have evolved in one direction (at least the stuff you find in academia or in AMSTAT - both research and training), and me in another one, that it does not make sense for me to call myself statistician.

Quality control, actuarial sciences, and operations research are more closely aligned with statistical science. Not sure why these professionals don't call themselves statisticians anymore, either.

Comment by Richard Meyer on October 30, 2014 at 7:32am

Question on Note 1 under Data Science Approach:  Wouldn't it be preferable to sample using a true simple random sampling method rather than the proposed systemic sampling method?  Thanks.

Rich M.

Comment by Vincent Granville on October 29, 2014 at 8:28am

I used to do what is called "computational statistics" during my PhD years back in 1990. But the word is no longer used (data science replaced it, at least in US).

Comment by abbas Shojaee on October 29, 2014 at 8:21am

I would like to suggest that "data science" can be considered as a wide umbrella that statistical analysis falls in it as one of many disciplines. I would prefer to discriminate them as statistical approaches versus computational approaches. This naming denotes that the the first group is based on and limited to probability theory and other groups do not.

In my opinion the disputes around statistics raise because until recent emergence of computational resources there was a of lack of a good substitute for statistics and as a result it has been overly used and applied in several scenarios that it is not meant or built for. It is a kind of maxtooling of statistics the same way that some people try to use the concept of spreadsheets (i.n. Excel) for every data storage and processing scenario which is not sufficient or efficient.

Also I'd like to add that, in above article sampling could be similar in computation approaches or statistical approach. But I agree that for modeling purposes, using computational approaches rather then statistical ones, can be less complex, less limited, faster and more information rich.

Comment by Vincent Granville on October 28, 2014 at 8:25am

Some statisticians claim that data scientists know nothing about sampling or experimental design. I wanted to show here that indeed, real data scientists do know these techniques. They belong to statistics as well, although the statistical and data science versions can be quite different (model-free confidence intervals, no p-value in data science, vs. distribution-based confidence intervals and statistical testing with p-values, in statistical science).

Also, someone mentioned using HyperLogLog rather than sort, to solve this problem.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service