While looking at the 2000+ most popular live articles on DSC, based on Google Analytics numbers, we found an interesting pattern.
The bottom 50% of these articles, those with pageview counts between 145 and 500 (so the least popular of the most popular group), have gaps in the pageview distribution, see below. The gaps is also widespread for the other half - the top 50%.
Our intern Livan is investigating this, as it is part of our new growth hacking strategy to be deployed soon (details to be published soon; the project involves categorizing popular articles, identifying time-sensitive articles such as events, and much more)
This study covered 2,000+ articles totaling more than 3 million page views, out of more than 40,000 live articles.
If you look at the above figure, 38 articles had 145 pageviews, 71 articles had 161 pageviews, but none had a pageview count between 145 and 161. Note that the gap in the distribution (161-145 = 16, 177-161 = 16) is always equal to 16. Any idea why this is happening?
We've found other oddities in Google Analytics reporting, such as the fact that all sessions that only have one pageview, last zero second (see section 3 after clicking on the link, where we propose a solution to this issue).
I apologize in advance if that sounds really naive of me (I have not used Google analytics before) but this, to me, looks like an histogram. So wouldn't this be normal ? However, what I find funny is that the bin size is not exactly 16 but more like 16.1 (as you get a couple of bins that are 17 wide apparently) and then rounded down (for display I hope).
Anyway, I wish I could be more enlighting here.
That's how the raw data coming from Google Analytics looks like (top 15 entries or so):
|DSC Digest and Membership - Big Data, Analytics, Visualization and DataScience||http://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-...||20-Jun-13||121817|
|20 short tutorials all data scientists should read (and practice)||http://www.datasciencecentral.com/profiles/blogs/17-short-tutorials...||15-Feb-14||88148|
|66 job interview questions for data scientists||http://www.datasciencecentral.com/profiles/blogs/66-job-interview-q...||13-Feb-13||67000|
|What your state is the worst at - United States of shame | Flowingdata||http://www.analyticbridge.com/profiles/blogs/what-your-state-is-the...||31-Jan-11||57150|
|My Data Science Book - Table of Contents||http://www.datasciencecentral.com/profiles/blogs/my-data-science-book||23-Nov-13||44549|
|Update about our Data Science Apprenticeship - March 10, 2014||http://www.datasciencecentral.com/group/data-science-apprenticeship...||10-Mar-13||41265|
|Data Scientist Core Skills||http://www.datasciencecentral.com/profiles/blogs/data-scientist-cor...||27-Aug-13||36550|
|Data scientists making $300,000 a year | Wall Street Journal.||http://www.datasciencecentral.com/profiles/blogs/data-scientists-ma...||29-Nov-12||28390|
|One Page R: A Survival Guide to Data Science with R||http://www.datasciencecentral.com/profiles/blogs/one-page-r-a-survi...||14-Feb-14||26555|
|Six categories of Data Scientists||http://www.datasciencecentral.com/profiles/blogs/six-categories-of-...||16-Jan-14||26459|
|Batch vs. Real Time Data Processing||http://www.datasciencecentral.com/profiles/blogs/batch-vs-real-time...||13-Aug-13||22902|
|10 types of regressions. Which one to use?||http://www.datasciencecentral.com/profiles/blogs/10-types-of-regres...||21-Jul-14||22709|
|Data Science Cheat Sheet||http://www.datasciencecentral.com/profiles/blogs/data-science-cheat...||1-Aug-14||22451|
|The 8 worst predictive modeling techniques||http://www.analyticbridge.com/profiles/blogs/the-8-worst-predictive...||23-Sep-12||22178|
|Proposal for an Apprenticeship in Data Science||http://www.datasciencecentral.com/profiles/blogs/proposal-for-an-ap...||25-Oct-12||20391|
|The curse of big data||http://www.analyticbridge.com/profiles/blogs/the-curse-of-big-data||5-Jan-13||20263|
Here's the bottom:
|What Are Analytic Marketplaces?||http://www.datasciencecentral.com/profiles/blogs/what-are-analytic-...||16-Sep-14||161|
|When to hire a data scientist: before or after a big crisis?||http://www.datasciencecentral.com/profiles/blogs/when-to-hire-a-dat...||23-Apr-12||161|
|Will The Buy The Dip Mentality Work In 2014?||http://www.analyticbridge.com/profiles/blogs/will-the-buy-the-dip-m...||9-Jan-14||161|
|DSC Webinar Series: Hadoop Automation: Eliminate Data Bottlenecks||http://www.datasciencecentral.com/video/dsc-webinar-series-hadoop-a...||10-Jun-14||161|
|The Analytic Career Part 1 of 2||http://www.datasciencecentral.com/video/the-analytic-career-part-1-...||18-Aug-14||161|
|Best Practices in SAS Statistical Programming for Regulatory Submission||http://www.analyticbridge.com/forum/topics/2004291:Topic:13277||9-May-08||145|
|Web mining vs. web analytics / advanced web analytics||http://www.analyticbridge.com/forum/topics/2004291:Topic:20302||12-Aug-08||145|
|eCPM, CPM, CPC||http://www.analyticbridge.com/forum/topics/2004291:Topic:21754?comm...||18-Aug-08||145|
|Alexa versus TrafficEstimate.com to quantify website traffic||http://www.analyticbridge.com/forum/topics/alexa-versus-trafficesti...||13-Dec-13||145|
|Analytics tools in Healthcare Industry||http://www.datasciencecentral.com/forum/topics/analytics-tools-in-h...||24-Feb-14||145|
|Analytics use cases and data sets||http://www.analyticbridge.com/forum/topics/analytics-use-cases-and-...||11-Apr-14||145|
|Best Free Software||http://www.analyticbridge.com/forum/topics/best-free-software||27-Nov-08||145|
|Carnival Corp.: Predictive entropic analysis||http://www.datasciencecentral.com/forum/topics/carnival-corp-predic...||16-Jan-12||145|
|Does BIG DATA give VALUE? especially for HEALTHCARE applications? what isimportant ........||http://www.analyticbridge.com/forum/topics/does-big-data-give-value...||1-Nov-13||145|
|Financial instruments price prediction||http://www.analyticbridge.com/forum/topics/financial-instruments-pr...||14-Jun-13||145|
|Has link analysis rendered association rules redundant?||http://www.analyticbridge.com/forum/topics/has-link-analysis-render...||12-Apr-13||145|
|How do you measure the popularity of an article or blog post?||http://www.datasciencecentral.com/forum/topics/how-do-you-measure-t...||12-Jan-13||145|
|Money laundering, counterfeit money: possible solution||http://www.datasciencecentral.com/forum/topics/money-laundering-cou...||10-May-14||145|
|Plane crash data||http://www.analyticbridge.com/forum/topics/plane-crash-data||24-Mar-09||145|
|Response rate forecasting||http://www.analyticbridge.com/forum/topics/response-rate-forecasting-1||14-Nov-09||145|
|SAS enterprise miner: forecast suggestions and advice?||http://www.analyticbridge.com/forum/topics/sas-enterprise-miner-fro...||15-Nov-13||145|
|Survey: What are the actual challenges and opportunities data scientists face?||http://www.datasciencecentral.com/forum/topics/survey-what-are-the-...||19-Nov-14||145|
|Using a Hurdle Model in Credit Scoring||http://www.analyticbridge.com/forum/topics/using-a-hurdle-model-in-...||20-Jun-13||145|
|Very geeky stuff...||http://www.datasciencecentral.com/forum/topics/very-geeky-stuff||20-May-14||145|
|What do you think of BigML?||http://www.analyticbridge.com/forum/topics/what-do-you-think-of-bigml||22-Nov-14||145|
|Who pays traffic violations for driverless cars?||http://www.datasciencecentral.com/forum/topics/who-pays-traffic-vio...||29-Sep-12||145|
|Will robots take control of the humans?||http://www.datasciencecentral.com/forum/topics/will-robots-take-con...||15-Oct-14||145|
|An Exclusive Interview with Data Expert, John Bottega||http://www.datasciencecentral.com/group/announcements/forum/topics/...||10-Jun-14||145|
|Big Data: Planning for the Future (Conference)||http://www.datasciencecentral.com/group/announcements/forum/topics/...||15-Jul-14||145|
|e-Book: Machine Learning and Recommendation Engine||http://www.datasciencecentral.com/group/announcements/forum/topics/...||2-May-14||145|
|Hadoop Deployment Best Practices: Scalability, Robustness, Flexibility||http://www.datasciencecentral.com/group/announcements/forum/topics/...||14-Aug-14||145|
|How to Build Dashboards That Persuade, Inform and Engage - Tableau||http://www.datasciencecentral.com/group/announcements/forum/topics/...||12-Jun-14||145|
|Learn How to be Data Driven||http://www.datasciencecentral.com/group/announcements/forum/topics/...||12-Nov-14||145|
|Live webinar: The Art and Science of Data Visualisation||http://www.datasciencecentral.com/group/announcements/forum/topics/...||23-Sep-14||145|
|Selecting the right text analytics tool for the job||http://www.datasciencecentral.com/group/announcements/forum/topics/...||14-Oct-14||145|
|This Month in Data Science: October 2014||http://www.datasciencecentral.com/group/announcements/forum/topics/...||5-Nov-14||145|
|Uplift Modeling: Super Hot Topic at Predictive Analytics World Boston||http://www.datasciencecentral.com/group/announcements/forum/topics/...||12-Sep-14||145|
|White Paper: Retail Analytics in Finance Industry||http://www.datasciencecentral.com/group/announcements/forum/topics/...||22-Oct-14||145|
|Fast clustering algorithms for massive datasets||http://www.datasciencecentral.com/group/research/forum/topics/fast-...||26-May-14||145|
|Big Data & Haute Cuisine||http://www.datasciencecentral.com/profiles/blog/show?id=6448529:Blo...||3-Dec-14||145|
|Data Science Meets Bubbly: What Data Says About Champagne Buying Patterns||http://www.datasciencecentral.com/profiles/blog/show?id=6448529:Blo...||24-Dec-14||145|
|10 tips for working with Hadoops | Cloudera||http://www.analyticbridge.com/profiles/blogs/10-tips-for-working-wi...||9-Dec-11||145|
|1010data's Unique Big Data Analytics Platform Sees Stunning Growth in 2011||http://www.analyticbridge.com/profiles/blogs/1010data-s-unique-big-...||3-Jan-12||145|
It does look like Google Analytics is feeding you histogram bins rather than the actual views count for the lower pageviews. Some type of data reduction they would apply. Maybe it has to do with DB access as well.
As a matter of fact, it could be the same histogram for all pageviews, Google Analytics would just drop the empty bins and you are likely to find only 1 article per bin at very high pageviews and with a small bin size of 16.
A search for duplicates in the pageviews column could help emphasize that as the number of duplicates should decrease with increasing pageviews.
As for why G.A would do that, I guess based on the observation that most webpages they analyse would have only a small number of pageviews, presenting the data as an histogram with a small binsize would reduce the amount of values to store while still keeping a high accuracy for the most important/significant results (high pageviews).
Benjamin, there's probably some binning / bucketization involved. You can request the counts for any data range, in real time, and it will return the results for the top 5,000 pages in real time, in about one minute.
I wonder if you are bumping into the GA limits for free accounts:
"If you exceed 10 million hits per month per property, there is no assurance that the excess hits will be processed. There are additional limits for specific client libraries."
Hi Greg, and nice to hearing back from you! I think you've found the right explanation.
Happy New Year Vincent. Nice to connect again.
Vincent Granville said:
Hi Greg, and nice to hearing back from you! I think you've found the right explanation.
Regarding gaps: it's either 10 million hits a month problem, or it may be sampling (google uses this for queries which contain more then ~250k sessions). If it's sampling you will see yellow message above report. Try limit queries to smaller time frames and then merge data.
About 0 seconds VisitDuration for 1 page visits - it's default GA behaviour. When person loads the 1st page the hit is sent and time starts to tic. When person goes to second page another hit is sent. VisitDuration is calculated as difference between timestamps for these two events. If second his is never sent - GA can't calculate VisitDuration
Hope this helps.
Looking forward to see the results of your new strategy.
It really looks like you've got sampled data. In my experience it happens when your GA request results in too many lines or when it requires data more than 2 years old (smaller requests will not help in this case). But there should be a warning message about it, at least I have it in RGA package.
Eduard, this was my feeling as well and that is what I tried to convey (probably not that well) in my previous comments.