Marrying computer science, statistics and domain expertize

I was reading an article written by Google scientists about how to predict ad click based on user query and the text ad. The article, written by a team of Google scientists - surely experts from top universities - focuses on the very large number of metrics in the model (billions of features), the use of some logistic regression (a statistical technique), and optimization techniques (numerical analysis, gradient methods) to solve the logistic regression (find the optimum regression coefficients). As you would expect, they discuss at length how the feature space is sparse, and how to take advantage of sparsity to design an efficient algorithm.

All of this looks great, and it is certainly a textbook example of a correct, good, interesting application of machine learning. Indeed, in my opinion, this is computer science.

I have two criticisms, and pretty much in all "pure science" stuff that I have read, the criticism is identical. It boils down to (1) using a nuclear weapon to kill a fly, and not realizing it, and (2) not showing the lift over a basic methodology designed by domain experts - in this case, experts with deep expertise simultaneously in ad technology, business management and statistics. While Google is doing better than many companies with their algorithms, I think they could do even better with less efforts, by focusing less on science and more on domain expertise.

This is precisely the gap that data science is trying to bridge.

Let me illustrate my point on this ad prediction technique, developed by Google.

1. What are we trying to accomplish?

That's the first question all data scientists should ask. Here, maybe Google is trying to maximize the number of clicks delivered to advertisers, to boost its revenue. Maybe the research paper is to be used by Google internally. Or maybe, the purpose is to help advertisers, who always want more clicks - as long as they are converting to sales.

  • If the paper is for Google's internal use, there should be a discussion about the fact that boosting click-through-rate (CTR) to increase Google's revenue only works short term. Over-boosting CTR (by the publisher, in this case Google) eventually results for many advertisers in lower ROI, as we have experienced it countless times. At least, there should be a discussion about long-term goals (boosting conversions) along with short-term goals (boosting CTR). Both are necessary and can not be considered separately in any business optimization problem.
  • If the paper is for advertisers, it misses the point: most advertisers (those interested in real traffic by real humans) are interested in conversions. It is very easy for advertisers to change the wording in their ads and add keywords to their campaigns to generate tons of clicks and... negative ROI. The exception are advertisers who are publishers themselves, and bill their advertising clients downstream using a per-impression model (where a click from Google is an impression for their clients) - in short, click arbitragers.

2. Do we need a nuclear weapon to kill a fly?

Using billions of features, most of them almost never triggered, make no sense. How do you handle co-dependencies among these features, and what statistical significance do you get from 99.9% of the features that are triggered no more than 3 times in a 500 billion observations (clicks). Sure, you could do some feature blending and clustering - a rather expensive technique, computationally speaking - but I think this issue of feature aggregation was not even discussed in their paper.

Also, the vast majority of these features are probably automatically created, through feature generation algorithm, This is by far the most intriguing component of their system - but it is not discussed in the paper. It's a combinatorial optimization problem, looking at all relationships (ratios, products, log transformations and other mappings such as IP category) among a set of base metrics such as log-file fields, to discover features with predictive power.. Some features are also created in bulk by analysts looking at data. This set of billions of features could very well be missing 2 or 3 core (but non obvious) features that would make the algorithm far superior. Google does not mention any of the features used in their algorithm, in the paper in question.

I believe that you can solve this ad click prediction problem with just a couple of features (a feature is a variable) carefully selected by a domain expert. Here are the ones that I would choose, I believe they are unlikely to be created by an automated feature generation algorithm.

My recommended features, to predict ad click

  • Keyword category matches category assigned to text ad? This means that you have an algorithm to assign categories to a user query and a text ad. This means that you have another algorithm to standardize user queries, and be able to discriminate e.g. between mining data (data about mining) and data mining (Google algorithms can't). It also means that you have a list of 500 categories, 100,000 sub-categories and 3 million sub-sub-sub categories, enough to cover 99.99% of all commercial user queries (where advertisers are bidding). Note that a keyword can have 2 or 3 terms, as in car insurance Alabama and two categories such as insurance and regional.
  • Special keywords found in text ad (e.g. 2013, new, free, best)
  • Both text ad and user query share a same rare sub-sub-category (this will increases odds of clicking)
  • Advertiser type: arbitrager (do not care if click does not convert - very high CTR) or real advertiser (usually has very low CTR)
  • Is this the first, second or third time the user sees this ad? Good ads work well initially, but if you never change your text ad, it will stop performing, CTR will go down.
  • Ad and user query related to popular event.
  • Domain name listed in text ad is trustworthy, respected? You need to have an algorithm that score domains, broken down by category; Google has page rank, but that's not enough, not granular enough.
  • Presence of special characters, capital letters

Of course the best solution is to blend features like mine, with the top features detected by an automated, machine learning algorithm and analysts.

3. Where smart statistics help

I have developed hidden decision trees to solve this type of problems, precisely after noticing the very high sparsity of the feature space. Do we need logistic regression with a gradient algorithm? Do we really need an exact solution when the data itself is very messy? I bet you can do great predictions using only 20 carefully selected features, and that's where the data scientist can also help: applying his statistical knowledge to create a system that runs 1,000 faster, uses much less computer resources, and provides similar or better results. You don't even need to use standard techniques such as (robust) logistic regression. I've been working with model-free statistics for a long time, with great satisfaction, and yes, I also computed model-free confidence intervals.

Another area where statistics can help - if you really like working with billions of features - is in identifying features with predictive power. I'm sure that most of the billion features used by Google have no predictive power, actually predictive power is never discussed in their article. Sometimes two features have no predictive power, but when combined together they do. For example country (US vs. UK) and time of day have far greater predictive power when combined together. Statistical science can help define predictive power, and assess when it is significant. Finally, if you have billions of features, you will necessarily find features that seem to have predictive power, but actually don't. Worse: these spurious features might overshadow the ones that truly have predictive power, making your system prone to systemic errors, and resulting in chaotic predictions. The reason (and the fix) is explained in details in my article the curse of big data. It is an issue that should be addressed using ad-hoc statistical analyses, not the kind of stats currently taught in university curricula.

Related articles

Views: 15699


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Mohammad Yusuf Ghazi on August 30, 2015 at 7:02pm
Great article. Thanks.
Comment by Boris Shmagin on September 10, 2013 at 11:27am

It is nice proposition, like new tune on the theme of existing one. Thanks

Comment by Nitin Supekar on September 9, 2013 at 8:28pm

It would be interesting to see what are the F-Score, Precision/ Range and other parameters of their algorithm. I agree with your point that after one point (convergence) more data or more features do not help much. I am sure they must have checked it.

Comment by Vincent Granville on September 7, 2013 at 6:59am

Here's another example. A company has a sophisticated segmentation model, based on statistical profiling of subscribers, and thanks to this model, they get a 15% boost in ROI when sending their newsletter. But because of spam, their bulk email does not go through Gmail. If they fixed this Gmail issue, the boost in ROI would be 100%. How to detect and fix this issue is the work of a domain expert, not a statistician, though it also involves a customer profiling via segmentation, but a different type of segmentation: a segmentation by type of ISP: Gmail, Yahoo mail, Hotmail, corporate email address, other personal email address etc.

Comment by Vincent Granville on September 6, 2013 at 9:28am

@Randy: One think that I am concerned about with the Google's "nuclear approach" (billion of features), is that they are going to find many correlations that are spurious. The impact of these spurious correlations, on the predictive algorithm, should be assessed. Or there should be a mechanism to handle and isolate them.

Note that my approach also involves large look-up tables, big keyword lists etc. But these are not features per se.

Comment by Randy Bartlett on September 6, 2013 at 7:26am

Great stuff, as usual!  Maybe what Google is doing makes sense for Google.  If your company has a dearth of talent with the skills to construct 'fly swatters' and a surplus of people, who can build nukes, then nukes are cheaper.  Perhaps, one risk is that nuke-builders might take their eye off of their fly. 

Comment by Vincent Granville on September 5, 2013 at 9:55pm

@Kaustubh: Yes a domain expert can easily find patterns and summarize them in a few rules and features:

  • time-to-load,
  • patterns found in domain name (xxx-and-yyy.net)
  • size of the landing page in KB (sometimes they all have the same size)
  • recency of domain (when it was created)
  • traffic associated with domain based on 3rd party data providers (e.g. Compete, Quantcast),
  • specific keywords or JS tags found on landing page,
  • info about domain owner (does he also own other domains, known to be bad?)
  • is the page different when you crawl it a second time
  • IP addresses hitting domain clusters / sub-affiliates (are these IP addresses not hitting other domains?),
  • IP address topology
Comment by kaustubh on September 5, 2013 at 5:38am

There was a company which used to generate landing pages and made money on clickthrough, they did not care for lifts or conversions (not much) just an ability to generate clicks on a landing page that you get to when you typed a wrong url (goooogr.com when you wanted to type google.com) etc.There was good money to be made( 0.5 million/day ) per version of algorithm. On about  3 million urls.

There is no domain expert who can make informed decision on 3 million landing pages in real time on topics that may be trending in real time , like Miley Cyrus the Disney princess or Miley Cyrus the twerker.

So that’s why they need the nuclear option , its just too much easy money and it cannot be done in real time by puny humans.

Comment by Vundemodalu Manjush on September 5, 2013 at 12:57am

Wow nice article.how did you implement this?

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service