I was reading an article written by Google scientists about how to predict ad click based on user query and the text ad. The article, written by a team of Google scientists - surely experts from top universities - focuses on the very large number of metrics in the model (billions of features), the use of some logistic regression (a statistical technique), and optimization techniques (numerical analysis, gradient methods) to solve the logistic regression (find the optimum regression coefficients). As you would expect, they discuss at length how the feature space is sparse, and how to take advantage of sparsity to design an efficient algorithm.
All of this looks great, and it is certainly a textbook example of a correct, good, interesting application of machine learning. Indeed, in my opinion, this is computer science.
I have two criticisms, and pretty much in all "pure science" stuff that I have read, the criticism is identical. It boils down to (1) using a nuclear weapon to kill a fly, and not realizing it, and (2) not showing the lift over a basic methodology designed by domain experts - in this case, experts with deep expertise simultaneously in ad technology, business management and statistics. While Google is doing better than many companies with their algorithms, I think they could do even better with less efforts, by focusing less on science and more on domain expertise.
This is precisely the gap that data science is trying to bridge.
Let me illustrate my point on this ad prediction technique, developed by Google.
1. What are we trying to accomplish?
That's the first question all data scientists should ask. Here, maybe Google is trying to maximize the number of clicks delivered to advertisers, to boost its revenue. Maybe the research paper is to be used by Google internally. Or maybe, the purpose is to help advertisers, who always want more clicks - as long as they are converting to sales.
2. Do we need a nuclear weapon to kill a fly?
Using billions of features, most of them almost never triggered, make no sense. How do you handle co-dependencies among these features, and what statistical significance do you get from 99.9% of the features that are triggered no more than 3 times in a 500 billion observations (clicks). Sure, you could do some feature blending and clustering - a rather expensive technique, computationally speaking - but I think this issue of feature aggregation was not even discussed in their paper.
Also, the vast majority of these features are probably automatically created, through feature generation algorithm, This is by far the most intriguing component of their system - but it is not discussed in the paper. It's a combinatorial optimization problem, looking at all relationships (ratios, products, log transformations and other mappings such as IP category) among a set of base metrics such as log-file fields, to discover features with predictive power.. Some features are also created in bulk by analysts looking at data. This set of billions of features could very well be missing 2 or 3 core (but non obvious) features that would make the algorithm far superior. Google does not mention any of the features used in their algorithm, in the paper in question.
I believe that you can solve this ad click prediction problem with just a couple of features (a feature is a variable) carefully selected by a domain expert. Here are the ones that I would choose, I believe they are unlikely to be created by an automated feature generation algorithm.
My recommended features, to predict ad click
Of course the best solution is to blend features like mine, with the top features detected by an automated, machine learning algorithm and analysts.
3. Where smart statistics help
I have developed hidden decision trees to solve this type of problems, precisely after noticing the very high sparsity of the feature space. Do we need logistic regression with a gradient algorithm? Do we really need an exact solution when the data itself is very messy? I bet you can do great predictions using only 20 carefully selected features, and that's where the data scientist can also help: applying his statistical knowledge to create a system that runs 1,000 faster, uses much less computer resources, and provides similar or better results. You don't even need to use standard techniques such as (robust) logistic regression. I've been working with model-free statistics for a long time, with great satisfaction, and yes, I also computed model-free confidence intervals.
Another area where statistics can help - if you really like working with billions of features - is in identifying features with predictive power. I'm sure that most of the billion features used by Google have no predictive power, actually predictive power is never discussed in their article. Sometimes two features have no predictive power, but when combined together they do. For example country (US vs. UK) and time of day have far greater predictive power when combined together. Statistical science can help define predictive power, and assess when it is significant. Finally, if you have billions of features, you will necessarily find features that seem to have predictive power, but actually don't. Worse: these spurious features might overshadow the ones that truly have predictive power, making your system prone to systemic errors, and resulting in chaotic predictions. The reason (and the fix) is explained in details in my article the curse of big data. It is an issue that should be addressed using ad-hoc statistical analyses, not the kind of stats currently taught in university curricula.
Related articles
Comment
It is nice proposition, like new tune on the theme of existing one. Thanks
It would be interesting to see what are the F-Score, Precision/ Range and other parameters of their algorithm. I agree with your point that after one point (convergence) more data or more features do not help much. I am sure they must have checked it.
Here's another example. A company has a sophisticated segmentation model, based on statistical profiling of subscribers, and thanks to this model, they get a 15% boost in ROI when sending their newsletter. But because of spam, their bulk email does not go through Gmail. If they fixed this Gmail issue, the boost in ROI would be 100%. How to detect and fix this issue is the work of a domain expert, not a statistician, though it also involves a customer profiling via segmentation, but a different type of segmentation: a segmentation by type of ISP: Gmail, Yahoo mail, Hotmail, corporate email address, other personal email address etc.
@Randy: One think that I am concerned about with the Google's "nuclear approach" (billion of features), is that they are going to find many correlations that are spurious. The impact of these spurious correlations, on the predictive algorithm, should be assessed. Or there should be a mechanism to handle and isolate them.
Note that my approach also involves large look-up tables, big keyword lists etc. But these are not features per se.
Great stuff, as usual! Maybe what Google is doing makes sense for Google. If your company has a dearth of talent with the skills to construct 'fly swatters' and a surplus of people, who can build nukes, then nukes are cheaper. Perhaps, one risk is that nuke-builders might take their eye off of their fly.
@Kaustubh: Yes a domain expert can easily find patterns and summarize them in a few rules and features:
There was a company which used to generate landing pages and made money on clickthrough, they did not care for lifts or conversions (not much) just an ability to generate clicks on a landing page that you get to when you typed a wrong url (goooogr.com when you wanted to type google.com) etc.There was good money to be made( 0.5 million/day ) per version of algorithm. On about 3 million urls.
There is no domain expert who can make informed decision on 3 million landing pages in real time on topics that may be trending in real time , like Miley Cyrus the Disney princess or Miley Cyrus the twerker.
So that’s why they need the nuclear option , its just too much easy money and it cannot be done in real time by puny humans.
Wow nice article.how did you implement this?
© 2021 TechTarget, Inc.
Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central