Subscribe to DSC Newsletter

While looking at the 2000+ most popular live articles on DSC, based on Google Analytics numbers, we found an interesting pattern.

The bottom 50% of these articles, those with pageview counts between 145 and 500 (so the least popular of the most popular group), have gaps in the pageview distribution, see below. The gaps is also widespread for the other half - the top 50%.

Our intern Livan is investigating this, as it is part of our new growth hacking strategy to be deployed soon (details to be published soon; the project involves categorizing popular articles, identifying time-sensitive articles such as events, and much more)

This study covered 2,000+ articles totaling more than 3 million page views, out of more than 40,000 live articles.

If you look at the above figure, 38 articles had 145 pageviews, 71 articles had 161 pageviews, but none had a pageview count between 145 and 161. Note that the gap in the distribution (161-145 = 16, 177-161 = 16) is always equal to 16. Any idea why this is happening?  

We've found other oddities in Google Analytics reporting, such as the fact that all sessions that only have one pageview, last zero second (see section 3 after clicking on the link, where we propose a solution to this issue).

DSC Resources

Additional Reading

Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge

Views: 4084

Reply to This

Replies to This Discussion

I apologize in advance if that sounds really naive of me (I have not used Google analytics before) but this, to me, looks like an histogram. So wouldn't this be normal ? However, what I find funny is that the bin size is not exactly 16 but more like 16.1 (as you get a couple of bins that are 17 wide apparently) and then rounded down (for display I hope).

Anyway, I wish I could be more enlighting here.

That's how the raw data coming from Google Analytics looks like (top 15 entries or so):

Title url Date Pageviews
DSC Digest and Membership - Big Data, Analytics, Visualization and DataScience http://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-... 20-Jun-13 121817
20 short tutorials all data scientists should read (and practice) http://www.datasciencecentral.com/profiles/blogs/17-short-tutorials... 15-Feb-14 88148
66 job interview questions for data scientists http://www.datasciencecentral.com/profiles/blogs/66-job-interview-q... 13-Feb-13 67000
What your state is the worst at - United States of shame | Flowingdata http://www.analyticbridge.com/profiles/blogs/what-your-state-is-the... 31-Jan-11 57150
My Data Science Book - Table of Contents http://www.datasciencecentral.com/profiles/blogs/my-data-science-book 23-Nov-13 44549
Update about our Data Science Apprenticeship - March 10, 2014 http://www.datasciencecentral.com/group/data-science-apprenticeship... 10-Mar-13 41265
Data Scientist Core Skills http://www.datasciencecentral.com/profiles/blogs/data-scientist-cor... 27-Aug-13 36550
Data scientists making $300,000 a year | Wall Street Journal. http://www.datasciencecentral.com/profiles/blogs/data-scientists-ma... 29-Nov-12 28390
One Page R: A Survival Guide to Data Science with R http://www.datasciencecentral.com/profiles/blogs/one-page-r-a-survi... 14-Feb-14 26555
Six categories of Data Scientists http://www.datasciencecentral.com/profiles/blogs/six-categories-of-... 16-Jan-14 26459
Batch vs. Real Time Data Processing http://www.datasciencecentral.com/profiles/blogs/batch-vs-real-time... 13-Aug-13 22902
10 types of regressions. Which one to use? http://www.datasciencecentral.com/profiles/blogs/10-types-of-regres... 21-Jul-14 22709
Data Science Cheat Sheet http://www.datasciencecentral.com/profiles/blogs/data-science-cheat... 1-Aug-14 22451
The 8 worst predictive modeling techniques http://www.analyticbridge.com/profiles/blogs/the-8-worst-predictive... 23-Sep-12 22178
Proposal for an Apprenticeship in Data Science http://www.datasciencecentral.com/profiles/blogs/proposal-for-an-ap... 25-Oct-12 20391
The curse of big data http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data 5-Jan-13 20263
I guess we would need to look at the raw data in the 145 to 500 pageviews range to see whether I was anywhere near what's happening.

Here's the bottom:

What Are Analytic Marketplaces? http://www.datasciencecentral.com/profiles/blogs/what-are-analytic-... 16-Sep-14 161
When to hire a data scientist: before or after a big crisis? http://www.datasciencecentral.com/profiles/blogs/when-to-hire-a-dat... 23-Apr-12 161
Will The Buy The Dip Mentality Work In 2014? http://www.analyticbridge.com/profiles/blogs/will-the-buy-the-dip-m... 9-Jan-14 161
DSC Webinar Series: Hadoop Automation: Eliminate Data Bottlenecks http://www.datasciencecentral.com/video/dsc-webinar-series-hadoop-a... 10-Jun-14 161
The Analytic Career Part 1 of 2 http://www.datasciencecentral.com/video/the-analytic-career-part-1-... 18-Aug-14 161
Best Practices in SAS Statistical Programming for Regulatory Submission http://www.analyticbridge.com/forum/topics/2004291:Topic:13277 9-May-08 145
Web mining vs. web analytics / advanced web analytics http://www.analyticbridge.com/forum/topics/2004291:Topic:20302 12-Aug-08 145
eCPM, CPM, CPC http://www.analyticbridge.com/forum/topics/2004291:Topic:21754?comm... 18-Aug-08 145
Alexa versus TrafficEstimate.com to quantify website traffic http://www.analyticbridge.com/forum/topics/alexa-versus-trafficesti... 13-Dec-13 145
Analytics tools in Healthcare Industry http://www.datasciencecentral.com/forum/topics/analytics-tools-in-h... 24-Feb-14 145
Analytics use cases and data sets http://www.analyticbridge.com/forum/topics/analytics-use-cases-and-... 11-Apr-14 145
Best Free Software http://www.analyticbridge.com/forum/topics/best-free-software 27-Nov-08 145
Carnival Corp.: Predictive entropic analysis http://www.datasciencecentral.com/forum/topics/carnival-corp-predic... 16-Jan-12 145
Does BIG DATA give VALUE? especially for HEALTHCARE applications? what isimportant ........ http://www.analyticbridge.com/forum/topics/does-big-data-give-value... 1-Nov-13 145
Financial instruments price prediction http://www.analyticbridge.com/forum/topics/financial-instruments-pr... 14-Jun-13 145
Has link analysis rendered association rules redundant? http://www.analyticbridge.com/forum/topics/has-link-analysis-render... 12-Apr-13 145
How do you measure the popularity of an article or blog post? http://www.datasciencecentral.com/forum/topics/how-do-you-measure-t... 12-Jan-13 145
Money laundering, counterfeit money: possible solution http://www.datasciencecentral.com/forum/topics/money-laundering-cou... 10-May-14 145
Plane crash data http://www.analyticbridge.com/forum/topics/plane-crash-data 24-Mar-09 145
Response rate forecasting http://www.analyticbridge.com/forum/topics/response-rate-forecasting-1 14-Nov-09 145
SAS enterprise miner: forecast suggestions and advice? http://www.analyticbridge.com/forum/topics/sas-enterprise-miner-fro... 15-Nov-13 145
Survey: What are the actual challenges and opportunities data scientists face? http://www.datasciencecentral.com/forum/topics/survey-what-are-the-... 19-Nov-14 145
Using a Hurdle Model in Credit Scoring http://www.analyticbridge.com/forum/topics/using-a-hurdle-model-in-... 20-Jun-13 145
Very geeky stuff... http://www.datasciencecentral.com/forum/topics/very-geeky-stuff 20-May-14 145
What do you think of BigML? http://www.analyticbridge.com/forum/topics/what-do-you-think-of-bigml 22-Nov-14 145
Who pays traffic violations for driverless cars? http://www.datasciencecentral.com/forum/topics/who-pays-traffic-vio... 29-Sep-12 145
Will robots take control of the humans? http://www.datasciencecentral.com/forum/topics/will-robots-take-con... 15-Oct-14 145
An Exclusive Interview with Data Expert, John Bottega http://www.datasciencecentral.com/group/announcements/forum/topics/... 10-Jun-14 145
Big Data: Planning for the Future (Conference) http://www.datasciencecentral.com/group/announcements/forum/topics/... 15-Jul-14 145
e-Book: Machine Learning and Recommendation Engine http://www.datasciencecentral.com/group/announcements/forum/topics/... 2-May-14 145
Hadoop Deployment Best Practices: Scalability, Robustness, Flexibility http://www.datasciencecentral.com/group/announcements/forum/topics/... 14-Aug-14 145
How to Build Dashboards That Persuade, Inform and Engage - Tableau http://www.datasciencecentral.com/group/announcements/forum/topics/... 12-Jun-14 145
Learn How to be Data Driven http://www.datasciencecentral.com/group/announcements/forum/topics/... 12-Nov-14 145
Live webinar: The Art and Science of Data Visualisation http://www.datasciencecentral.com/group/announcements/forum/topics/... 23-Sep-14 145
Selecting the right text analytics tool for the job http://www.datasciencecentral.com/group/announcements/forum/topics/... 14-Oct-14 145
This Month in Data Science: October 2014 http://www.datasciencecentral.com/group/announcements/forum/topics/... 5-Nov-14 145
Uplift Modeling: Super Hot Topic at Predictive Analytics World Boston http://www.datasciencecentral.com/group/announcements/forum/topics/... 12-Sep-14 145
White Paper: Retail Analytics in Finance Industry http://www.datasciencecentral.com/group/announcements/forum/topics/... 22-Oct-14 145
Fast clustering algorithms for massive datasets http://www.datasciencecentral.com/group/research/forum/topics/fast-... 26-May-14 145
Big Data & Haute Cuisine http://www.datasciencecentral.com/profiles/blog/show?id=6448529:Blo... 3-Dec-14 145
Data Science Meets Bubbly: What Data Says About Champagne Buying Patterns http://www.datasciencecentral.com/profiles/blog/show?id=6448529:Blo... 24-Dec-14 145
10 tips for working with Hadoops | Cloudera http://www.analyticbridge.com/profiles/blogs/10-tips-for-working-wi... 9-Dec-11 145
1010data's Unique Big Data Analytics Platform Sees Stunning Growth in 2011 http://www.analyticbridge.com/profiles/blogs/1010data-s-unique-big-... 3-Jan-12 145

It does look like Google Analytics is feeding you histogram bins rather than the actual views count for the lower pageviews. Some type of data reduction they would apply. Maybe it has to do with DB access as well.

As a matter of fact, it could be the same histogram for all pageviews, Google Analytics would just drop the empty bins and you are likely to find only 1 article per bin at very high pageviews and with a small bin size of 16.

A search for duplicates in the pageviews column could help emphasize that as the number of duplicates should decrease with increasing pageviews.

As for why G.A would do that, I guess based on the observation that most webpages they analyse would have only a small number of pageviews, presenting the data as an histogram with a small binsize would reduce the amount of values to store while still keeping a high accuracy for the most important/significant results (high pageviews).

Benjamin, there's probably some binning / bucketization involved. You can request the counts for any data range, in real time, and it will return the results for the top 5,000 pages in real time, in about one minute. 

I wonder if you are bumping into the GA limits for free accounts:

"If you exceed 10 million hits per month per property, there is no assurance that the excess hits will be processed. There are additional limits for specific client libraries."

Hi Greg, and nice to hearing back from you! I think you've found the right explanation.

Happy New Year Vincent. Nice to connect again.

Vincent Granville said:

Hi Greg, and nice to hearing back from you! I think you've found the right explanation.

Hi Vincent.

Regarding gaps: it's either 10 million hits a month problem, or it may be sampling (google uses this for queries which contain more then ~250k sessions).  If it's sampling you will see yellow message above report. Try limit queries to smaller time frames and then merge data.

About 0 seconds VisitDuration for 1 page visits - it's default GA behaviour. When person loads the 1st page the hit is sent and time starts to tic. When person goes to second page another hit is sent. VisitDuration is calculated as difference between timestamps for these two events. If second his is never sent - GA can't calculate VisitDuration

Hope this helps.

Looking forward to see the results of your new strategy.

Best regards,

Ivan

It really looks like you've got sampled data. In my experience it happens when your GA request results in too many lines or when it requires data more than 2 years old (smaller requests will not help in this case). But there should be a warning message about it, at least I have it in RGA package.

Eduard, this was my feeling as well and that is what I tried to convey (probably not that well) in my previous comments.

RSS

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service