This is data science from the trenches - both a case study, and a tutorial for data sciencist candidates. Here I illustrate how gut feelings, carefully selected data (rather than getting granular data), full understanding of business (horizontal knowledge), high level vision, and outsourcing (to make data science almost free) combined together, makes a data science project successful. I also share with you the data set used in this project: top 5,000 webpages from our network in the first few weeks of 2014, with rather detailed metrics; the time period is less than 3 months, but more than 1 month.
The goal here is to assess the effectiveness of our Google advertising campaigns, and how to better shift and optimize our traffic sources. This data science analysis is about our own Data Science Central websites and business model, but you will learn trends that apply to many other businesses, regarding Google, LinkedIn, Facebook, Twitter and Google+ traffic.
Example of complex optimization problem, with multiple maxima
1. Fun facts
Here are a few fun facts, as an appetizer:
2. Data science with no data scientist, no data silo
The data scientist in charge of this study is the co-founder of the company. As a lean start-up with 0 employee (generating 10 times more revenue with 80% margin, than my previous money-losing VC-funded startup that had 20 employees), we just don't hire a data scientist, though we have the money to do so if we wanted. Instead, an executive familiar with all business aspects (me) spends 5% of his time on this type of investigations, and the money saved on a data scientist goes to profit sharing. Many executives nowadays, especially in technology company like ours, have a strong analytic acumen (and background) and can do just like us.
All the data collected comes from vendors: reporting is 100% outsourced and cost little to nothing. We get
There's no silo: one guy (me) has access to, understand, and blend all the data, including financial data (revenue, costs) broken down per product.
Here we will focus on Google Analytics data, as well as financial data, at a rather high level. The roles of data scientist and business analyst overlap here: one person wears both (and many more) hats, saving a lot of money in payroll to allow us to better compete with other, over-staffed companies.
3. No need to create your own big data: instead, leverage external big data, at no cost
I've worked only with the top 5,000 webpages but it would have been very easy to download the Google Analytics data for the 41,009 active pages. The reason to work with 5,000 is to get you familiar with sampling, and prove that you can get great results with just sample data. This being a tutorial part of our data science apprenticeship, it is important that you get familiarized with sample data.
Regarding our Google advertising campaigns, we selected keywords suggested by Google itself rather than doing our own time-consuming research. This is actually leveraging Google's massive multi-billion data base of priced keywords, at no cost. A next step would be to customize bids per keyword and ad group, but I believe it won't provide much added value. Finding the top 20 keywords that need customized bids and optimize them is good enough. More than that, and we might be doing data science producing negative ROI after factoring the cost of data science. Of course it's a different story if you are eBay and manage 10 million keywords.
Another weakness is that we don't track conversions yet (new members) on Google analytics. But we have a pretty good idea about conversions and we even created a metric called value, attached to each web page in the data set that we share with you in the last section. We will soon track conversions in Google Analytics, as this will help us drop poor-performing keywords: it is worth the effort (also it helps Google optimize our campaigns for conversions rather than for clicks, a much better solution).
I bought a book on Google Adwords (cost $50) and learned one great thing: how to set up display campaigns where your ads show up only on websites that you have selected (such as our competitors that accept Google ads, or other data science websites). This saved me a lot of money in attending classes or hiring a Google expert. Also, me might use the service of a SEM/SEO company in the future, but again it will be outsourced (vendor relationship). And for now, since our network is sitting on the Ning platform (saving sys admin and server costs), we automatically benefit from Ning SEO efforts. The reason I mention this is to show how analytic thinking / gut feeling help decide how deep you want to go with data science. As a small, lean start-up, we don't want to over-spend, we have a pretty good idea when we spend too much (for instance, if all this activity eats more than 20% of our budget, unless it boosts total revenue).
One of the nice things is that all reporting activities are automated.
4. Interpreting results, transforming insights into actions
Our problem is complex. We don't have a dollar amount attached to a conversion, and in general, we don't charge clients by number of impressions or clicks: we typically offer fixed fees with guaranteed numbers in terms of leads, impression or clicks.
We want to keep ad spend below 10% of our budget (our margin is currently 80%). Currently Ad Spend is about 4% of our gross revenue. We can't easily increase this 4% figure, because to get more traffic, we would need to increase our bids, which would eventually generate negative ROI. It is important that you know your break-even point. For us, the maximum cost of acquisition (to maximize revenue) of a conversion has not been fully calculated yet, but it is below $10, in other words, $3 per Google paid click maximum.
The impact of ad spend on our traffic (page views) is small: less than 3%. But it is much bigger on conversions (our main source of revenue), accounting for 25% of all conversions, and diversifying our conversion sources to minimize risks. An easy way to measure the 25% is using a different landing page for each source (combined with proper taggings for the conversion URL) so that we can identify the origin of the conversion (Google AdWords, direct traffic, LinkedIn, etc.) Or you can turn on/off Google AdWords and see the impact.
Note that we purchase mostly US / UK / Canada / Australia traffic and avoid midnight to 4am traffic, in order to increase the quality of the paid traffic that we receive from Google: this is another way to leverage a vendor's (Google) big data capabilities without incurring big data costs. Indeed, now our Google paid traffic is better than our Google organic traffic, as it is well targeted and focused on driving traffic to the conversion page.
For every $100 of revenue that we make, $15 is coming from impressions (page views) and $70 is linked in some ways to the number of active subscribers and members.
Google ad spend eats $4 (from these $100) but produces only 2% of impressions. We haven't done survival analysis to assess how many page views a user generates over his lifetime, broken down by acquisition channel. Plus, attribution modeling would suggest that some of the new users coming from Google ads would still be acquired by a cost-free channel, if we did not use any Google advertising.
Nevertheless, it is clear that the Google Ad Spend has negative ROI with respect to page views, but the total dollar amount is small. Since we don't operate in silos, we also check the impact on conversions (subscribers, members). We estimate that 25% of our new members come from Google ads, that is, 25% of $70 in revenue can be attributed to Google ads (though some would join via a free channel if we did not advertise, and users acquired by Google ads have higher churn - just a wild guess). So I'll reduce the 25% to 15%. In short $10.50 = 15% of the $70 revenue, costs us $4 (Google Ad Spend), and thus, Google Ad Spend works for us, we can even increase our CPC and budget.
However, the situation is more complicated than it seems at first glance. Getting more traffic makes sense if we get more revenue. We could increase the fee for our services (email blasts) if we deliver to more subscribers, resulting in more clicks and more leads for the clients. But this is not obvious: increasing prices can deter clients - clients also have fixed budgets. We can easily get more clients, but we can not send more than one blast per day: at some point, our inventory is full booked. We could segment our member database, send more blasts to more targeted, smaller groups of people. That's the way to go to grow revenue along with traffic. Another way is to reach an equilibrium, have our company run on auto-pilot, and start another one (maybe a community for astronomers) and then another one. We are definitely contemplating this option.
Note: Google Analytics reports contain a column called Page Value, based on conversion and revenue per page, for each of the 40,009 active pages. We don't track conversions yet in Google analytics, but we've found a good proxy for page value, using two other columns from Google Analytics reports, Entrances and % Exit. Then Page Value = Entrances * (2 - % Exit). Entrances is the number of times the page in question is an entrance page, % Exit is the number of times (proportion) it's an exit page. If Entrance = 1,000 and % Exit = 30%, you can expect at least 0.7 extra page views (0.7 = 1 - 30%) after the entrance, providing a conservative page value of 1,000 * (2 - 0.30) = 1,700 single page views attributable to the page in question.
Based on this analysis, we decided to:
Future steps will involve automated content syndication and content mix optimization. In particular, detecting how to optimize the following mix:
5. Year-to-year comparison
The year-over-year tab in the spreadsheet (see next section) shows a spectacular growth (> 80%) in incoming traffic, for Google organic traffic and in direct traffic (driven by email campaigns). Google organic and direct traffic represents 66% of the visits (33% for Google organic, a perfectly normal number, especially since we don't do any SEO) and 33% for direct traffic (quite good, with growth driven by membership growth after factoring in churn). LinkedIn, although the traffic is better with more page views by visit, is barely growing, which is good since our reliance on LinkedIn was too high in the past, representing a risk. Twitter is very promising and will eventually surpass LinkedIn, in terms of share of incoming traffic. Our Twitter advertising campaigns contribute to this shift. Facebook and Google+ bring modest contributions, and we don't expect spectacular growth from these traffic sources, though Facebook advertising has gotten better over time (less fake traffic, more reasonable CPC).
Finally, we've noticed that LinkedIn and Google organic traffic sources are negatively correlated. The more we get from LinkedIn (by posting on LinkedIn), the less we get from Google, as the LinkedIn links to our articles show up above our internal links, on Google. This is actually an incentive for us to either do better SEO (to beat LinkedIn and the fact that Google wrongly attributes our articles to LinkedIn), or to post less on LinkedIn. We've chosen the latter. However, our posts on LinkedIn get re-tweeted or re-posted outside LinkedIn, resulting mostly in direct traffic to our website. We haven't quantified the amount of traffic indirectly generated via LinkedIn, but it might represent 10% of our direct traffic (based on some bit.ly statistics). The same applies to Facebook, Google+ and Twitter, but not to Google organic or paid traffic.
6. Get the data set
Click here to download Google Analytics report, with traffic metrics for 5,000 top pages and estimates for all 40,009 active pages on our websites, during time period in question.
7. Other links