Most of you will read this article to discover the most popular blogs, but the real purpose here is to show what goes wrong with many data science projects as simple as this one, and how it can easily be fixed. In the process, we created a new popularity score, much more robust than any ranking used in similar articles (top bloggers, popular books, best websites etc.) This scoring, based on a decay function, could be incorporated in recommendation engines.
Figure 1: Example of star-rating system
1. Introduction
The data covers almost three years worth of DSC (Data Science Central) traffic: more than 50,000 posts, and more than 6 million pageviews (almost half of it in 2014 alone), across our channels: DSC, Hadoop360, BigDataNews, and AnalyticBridge (but not AnalyticTalent).
I actually included 46 articles, but only the top 30 have a highly accurate rank. Some articles have been filtered out as they belong to a cluster of similar articles (education, books, etc.) Finally some very popular pages (in the top 5) are not included because the creation date is not available (the Visualization, Top Links, and Jobs tabs at the top of any page), or because they just should not be listed (my own profile page, the sign-up page, the front page, etc.)
The new scoring model is described below, in the Scoring Engine section below. Also, you will find useful comments in the Interesting Insights section below.
2. Top 30 DSC blogs
The number in parentheses represents the rank if instead of using our popularity score, we had used standard methodology. The date represents when the blog was created. By just looking at these fields, you might be able to guess what our new scoring engine is about. If not, explanations are provided in a section below.
3. Interesting Insights
These top pages represent 21% of our traffic. The front page amounts to another 9%. And the top pages that are filtered out (for various reasons, read introduction) to a few more percents. Here are some of the highlights:
Figure 2: Pageview decay (or absence of decay!) for 4 top blogs listed above
4. New Scoring Engine
Let's say that you measure the pageview count for a specific article, and your data frame goes from t = t_0 to t = t_1. Models like this typically involve exponential decay of rate r, meaning that at time t, the pageview count velocity is f(r, t) = exp(-rt). With this model, the theoretical number of pageviews between t_0 and t_1 is simply
P = g(r, t_1) - g(r, t_0),
where g(r, t) = {1 - exp(-rt)} / r.
If t_0 is set to zero, then g(r, t_0) = 0, regardless of r. On a different note, the issue we try to address here (adjusting for time bias) is known as left- and right-censored data in statistical science: right-censored because we don't have data about the future, and left-censored because we don't have data prior to 2012.
To adjust for time bias, define the new popularity score as S = PV / P, where PV is the observed page view count during the time period in question. When r = 0 (no noticeable decay, which is the case here) and t_0 = 0, then P = t. Note that the only two metrics required to compute the popularity score S, for a specific article, are: time elapsed since creation date, and pageview count during the time frame in question, according to Google Analytics, after aggregating identical pages with different URL query strings.
Note: To make sure that we were not missing popular articles posted recetly, we collected the data using two overlapping time frames: one data set for 2012-2014, and one just for 2014, using CSV exports from Google Analytics. Several articles that did not show up in the 2012-2014 data set (because their raw pageview count was below our threshold of about 10,000 pageviews), actually had top scores S when adjusted for time, and could only be found by using the 2014 data. Another way to eliminate this issue is to get statistics for all articles (not just the ones with lots of traffic) for the whole time period. That's the automated approach, and in our case it would have required writing extra pieces of code, and possibly Google API calls, to download time stamps on Ning (via web crawling) and the entire Google Analytic data for the 50,000 articles - not worth the effort, especially since I allowed myself only a couple of hours to complete this project.
5. Good versus bad data science
Using the basic model with r = 0 (in section 4) makes a big difference with traditional rankings, as you can see when comparing our rankings to traditional rankings, in our list of top articles in section 2 (sorted according to our popularity score with r = 0). It allows you to detect trends about what is getting popular over time.
This is what makes the difference between good and bad data science. Note that refining the model, estimating a different r for each article, testing the exponential decay assumption, and adjusting for growth, is also bad data science: it makes the time spent on this project prohibitive, make your model subject to over-fitting, and may jeopardize the value (ROI) of the project.
Data science is not about perfectionism, but about delivering on time and within budget. In this case, if I spend one month on this project (or outsource to people who work with me), it's time wasted on something that could yield far more value than the little incremental gain obtained by seeking perfection. Yet ignoring the decay is equally bad, it makes this whole project worthless. The data scientist must instinctively find what level of perfection is needed, in his models. Data is always imperfect anyway.
6. Next steps
One interesting project is to group pages by categories and aggregate popularity scores. Maybe create popularity scores for categories. Indeed, Nikita Nikitinsky has been working on this problem, indirectly. It was his project during his data science apprenticeship (DSA): we will soon publish the results and announce his successful completion of the DSA (see applied project #3). He is the first candidate to complete the DSA, besides our intern Livan (who worked on a number of projects including our Twitter app to detect top data scientist profiles), and the winner of the Jackknife competition.
Other potential improvements include:
Another area of research is to understand why webpage pageview counts closely follows a Zipf distribution.
Related Links
The articles below come with detailed explanations about the (sound) methodology used to obtain the results.
Posted 1 March 2021
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central