Here I describe a case study: a solution based on high-level data science. By high level, I mean data science not done by statisticians, but by decision makers accessing, digging into, and understanding summary (dashboard) data to quickly make a business decision with immediate financial impact. There is also a section on smart imputation techniques, with patentable, open intellectual property that we created after investigating this problem.
This article is articulated in three sections
Figure 1: Sudden drop in AdWords performance starting on Day 2
Y-axis shows conversion rate, and X-axis shows traffic source
1. High-level versus low-level data science
I have discussed various breakdowns or categorizations for data science:
Here I introduce a new type of distinction: high-level versus low-level.
Most people think that data science is low-level data science only, but that's not the case. Note that low-level data science is to low-level programming what high-level data science is to high-level programming. The low-level layer is more technical and more complex: it's the layer on which the high-level rests. But the high-level layer requires different skills, including business acumen, leadership and domain expertise.
About the problem and data set
The problem studied here was solved by the decision maker (a data scientist), using high-level data only, that is, highly summarized data. The data scientist in question monitors carefully selected compound web traffic metrics (KPIs, usually ratios) and noticed a spectacular drop for one of these metrics
Access to highly granular (low-level) data was not easy to get, and dashboard summaries, carefully selected and crafted, were sufficient to detect and address the issue with a one-week turnaround, doing a number of tests described in the next section.
More specifically, we used the Google Analytic dashboard. We did not access granular metrics such as IP address, detailed log-file transactions, or summary statistics broken down by user agent / referral combinations (not available from the dashboard). But we did use session duration, number of pages, and conversions, per day per referral, probing the summary data sometimes 2-3 times per day to check the results of a number of tests and fine-tuning, in short to check and quantify impact on performance. Performance here is measured as the number of real (not bogus) conversions per click, or conversion rate. We also looked at conversion rate, per paid keyword per day, available from the dashboard. The statistics per user agent per referral would have been very useful, but was not available. The user agent alone proved very useful in a related problem.
2. Detecting and fixing performance issue with high level data science
The detection of the problem is straightforward if you monitor the right KPIs, as all of us data scientists should do: see Figure 1. It might even be made easier (earlier detection) if there is a system of automated email alerts in place, designed by high-level data scientists. It is always a good idea to have an email alert system in place, for core metrics impacting revenue. Note that in Figure 1, none of the variations between Day 1 and Day 2 are significant, except the one for AdWords - both sharp and statistically significant (because based on a large number of sessions).
It was initially believed that the performance drop (click conversion falling from above 20% to below 10%) was due to click fraud, though we did not exclude a reporting glitch initially. A reporting glitch was ruled out - as a contributing factor - when actual conversions, measured internally (rather than via Google Analytics), were also down. We also tested our own ads, simulating conversions, to see if they were properly tracked by Google. They were.
Anyway, we eliminated most major sources of click fraud: traffic from affiliates, and mobile traffic. We ended up with clicks from US-based IP addresses only, from Google.com only, and not from a mobile device. The pattern of poor performance continued nevertheless. We assumed it could be an advertiser trying to deplete budgets from its competitors, hitting all data science related keywords and advertisers (we've proved in the past, with a real test, that you can indeed eliminate all your competitors on Google AdWords). Such schemes take place when a domain becomes very popular, and the bidding more aggressive.
So we created another, smaller campaign - let's call it campaign B - with a slightly different mix of keywords, and especially using a redirect URL (www.datascienceworld.com) rather than the actual destination URL www.datasciencecentral.com, just in case the perpetrator was targeting us in particular. We noticed that when pausing the campaign for several hours, the performance came back to normal levels when resuming, but was again quickly falling down after a day or so. The use of an alternate campaign with a redirect URL did not help much. We tested using campaign A on odd days and campaign B on even days, with limited success.
We decided not to discuss the issue with Google (despite our high ad spend) as we thought - based on experience with other vendors - that it would be a waste of time. What was surprising is the fact that even with the new campaign B with fewer keywords, our daily budget was not depleted (we could not reach our daily ad spend target, we were well below), and the ads were still showing on Google searches as usual, for our target keywords. Only the conversions were missing. This is not consistent with click fraud, and to this day, we still don't know the cause. Maybe we now show up for keywords with poor performance: keywords for which our bid was previously too low, preventing (and protecting) us from winning the auction in the past. Maybe competitors abandoned these keywords last week, and now we "inherit" them. But I've only identified two such high volume, poor-performing keywords, and the issue is definitely broader than that.
So what did we do?
We reduced our ad spend on Google and boosted our very effective advertising on Twitter, as it is not subject to click fraud. Unlike Google, you can't predict when our ad is going to show up on Twitter, and it shows up only on certain highly relevant profiles - if you are not one of them you won't see it. It makes it much harder to generate fraudulent clicks.
We also diverted some of our Google ad spend to editorial initiatives: these two compete, and when ROI on Google AdWords drops too much, editorial initiatives win (to drive traffic to our network). We are also confident, based on our observations, that the problem will fix itself on Google AdWords, and that we will be able to resume to higher levels of advertising (with Google) after careful testing and permanent monitoring.
Finally, we optimized our bidding to maximize total conversions (on Google AdWords), rather than total clicks. It is still too early to check if this strategy is working. Theoretically, it should automatically optimize our keyword mix, and abandon poor performers.
3. Fixing massive bias in Google Analytics via imputation methods
This bias impacts user engagement statistics. Figure 2 below shows a 40% bias in the way Google Analytics (and many of their competitors) measure session duration, on simulated but very typical web traffic. In short, the last page of a visit is not accounted for in the Google Analytics reports. It would be fine if visits had dozens of page views on average. Unfortunately, one-page visits are by far the most common on many websites, because pages per visit have a Zipf distribution. And in one-page visits, the first page is also the last page (not accounted for).
This bias is a bigger issue for digital publishers that own multiple channels, each one having its own domain name, rather than a sub-domain, and where visit paths typically span across multiple channels (e.g. from AnalyticBridge.com to DataScienceCentral.com to BigDataNews.com).
Figure 2: Google Analytics reports session duration 40% below real value
These zero-seconds, one-page visits (as well as 10-minutes, either two-pages or fifty-pages visits) scare digital publishers and advertisers, as it looks like artificial traffic. Correcting this Google Analytic error would be a win-win, both for Google, publishers and advertisers. And I have a proposal below on how to fix this. The challenge is to convince a company like Google to embrace statistical science techniques to make the change. These companies would rather use exact numbers (but actually very wrong, as previously described), rather than approximations that are far more accurate - in this case resulting in a 40% error, see figure 2.
Correctly measuring duration of web sessions, via imputation
We focus here on measuring the duration of one-page sessions only (currently measured as zero), as this will solve the whole problem.
The idea is to
The extrapolation step works as follows (see figure 2). If in a particular bucket, 2-page sessions last 1.60 minute (on average), then the 1.60 minute is actually spent on page 1. So instead of assigning a session duration of zero minute to one-page sessions, you now assign a duration of 1.60 minute. Likewise, the duration of a 2-page session is increased by 1.60 minute. However, this is a very rough and biased approximation: many one-page sessions are hard bounces or user errors (clicking the wrong link) that last just 1 second. So you need to look at the full distribution of duration - as a function of number of pages per session - to make a better extrapolation. The end result will likely be a value closer to 0.80 minute for one-page sessions, for the traffic bucket in question.
The bucketization step (sometimes called multivariate binning) consists of identifying metrics (and combinations of 2-3 metrics) with high predictive power, combine and bin them appropriately, to reduce intra-bucket variance while keeping the buckets big enough. This is complex data science, and you can find details in my article on fast combinatorial feature selection. It certainly helps to have domain expertise, and in this case, a bucket of traffic can simply be defined as traffic occurring on a same day, from a same referral. Small referrals must be categorized (paid, organic, search, social, mobile, syndicated traffic etc.) and then aggregated by category, to have big enough buckets. And as in all inference processes, don't forget the very important cross-validation step.
You can go one step further by defining a similarity metric between data buckets, and extrapolate using multiple, similar buckets, rather than just the target bucket alone. This will automatically give you confidence intervals for the duration of a one-page visit, at the bucket level. You will likely need Hadoop or some Map-Reduce architecture for these computations, if you have more than a few million buckets: you will have to create a giant hash table of all buckets, to store, for each bucket ID, its list or 5-10 most similar bucket IDs.