While looking at the 2000+ most popular live articles on DSC, based on Google Analytics numbers, we found an interesting pattern.
The bottom 50% of these articles, those with pageview counts between 145 and 500 (so the least popular of the most popular group), have gaps in the pageview distribution, see below. The gaps is also widespread for the other half - the top 50%.
Our intern Livan is investigating this, as it is part of our new growth hacking strategy to be deployed soon (details to be published soon; the project involves categorizing popular articles, identifying time-sensitive articles such as events, and much more)
This study covered 2,000+ articles totaling more than 3 million page views, out of more than 40,000 live articles.
If you look at the above figure, 38 articles had 145 pageviews, 71 articles had 161 pageviews, but none had a pageview count between 145 and 161. Note that the gap in the distribution (161-145 = 16, 177-161 = 16) is always equal to 16. Any idea why this is happening?
We've found other oddities in Google Analytics reporting, such as the fact that all sessions that only have one pageview, last zero second (see section 3 after clicking on the link, where we propose a solution to this issue).
DSC Resources
Additional Reading
Follow us on Twitter: @DataScienceCtrl | @AnalyticBridge
Tags:
I apologize in advance if that sounds really naive of me (I have not used Google analytics before) but this, to me, looks like an histogram. So wouldn't this be normal ? However, what I find funny is that the bin size is not exactly 16 but more like 16.1 (as you get a couple of bins that are 17 wide apparently) and then rounded down (for display I hope).
Anyway, I wish I could be more enlighting here.
That's how the raw data coming from Google Analytics looks like (top 15 entries or so):
Here's the bottom:
It does look like Google Analytics is feeding you histogram bins rather than the actual views count for the lower pageviews. Some type of data reduction they would apply. Maybe it has to do with DB access as well.
As a matter of fact, it could be the same histogram for all pageviews, Google Analytics would just drop the empty bins and you are likely to find only 1 article per bin at very high pageviews and with a small bin size of 16.
A search for duplicates in the pageviews column could help emphasize that as the number of duplicates should decrease with increasing pageviews.
As for why G.A would do that, I guess based on the observation that most webpages they analyse would have only a small number of pageviews, presenting the data as an histogram with a small binsize would reduce the amount of values to store while still keeping a high accuracy for the most important/significant results (high pageviews).
Benjamin, there's probably some binning / bucketization involved. You can request the counts for any data range, in real time, and it will return the results for the top 5,000 pages in real time, in about one minute.
I wonder if you are bumping into the GA limits for free accounts:
"If you exceed 10 million hits per month per property, there is no assurance that the excess hits will be processed. There are additional limits for specific client libraries."
Hi Greg, and nice to hearing back from you! I think you've found the right explanation.
Happy New Year Vincent. Nice to connect again.
Vincent Granville said:
Hi Greg, and nice to hearing back from you! I think you've found the right explanation.
Hi Vincent.
Regarding gaps: it's either 10 million hits a month problem, or it may be sampling (google uses this for queries which contain more then ~250k sessions). If it's sampling you will see yellow message above report. Try limit queries to smaller time frames and then merge data.
About 0 seconds VisitDuration for 1 page visits - it's default GA behaviour. When person loads the 1st page the hit is sent and time starts to tic. When person goes to second page another hit is sent. VisitDuration is calculated as difference between timestamps for these two events. If second his is never sent - GA can't calculate VisitDuration
Hope this helps.
Looking forward to see the results of your new strategy.
Best regards,
Ivan
It really looks like you've got sampled data. In my experience it happens when your GA request results in too many lines or when it requires data more than 2 years old (smaller requests will not help in this case). But there should be a warning message about it, at least I have it in RGA package.
Eduard, this was my feeling as well and that is what I tried to convey (probably not that well) in my previous comments.
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles