I was reading an article written by Google scientists about how to predict ad click based on user query and the text ad. The article, written by a team of Google scientists - surely experts from top universities - focuses on the very large number of metrics in the model (billions of features), the use of some logistic regression (a statistical technique), and optimization techniques (numerical analysis, gradient methods) to solve the logistic regression (find the optimum regression coefficients). As you would expect, they discuss at length how the feature space is sparse, and how to take advantage of sparsity to design an efficient algorithm.
All of this looks great, and it is certainly a textbook example of a correct, good, interesting application of machine learning. Indeed, in my opinion, this is computer science.
I have two criticisms, and pretty much in all "pure science" stuff that I have read, the criticism is identical. It boils down to (1) using a nuclear weapon to kill a fly, and not realizing it, and (2) not showing the lift over a basic methodology designed by domain experts - in this case, experts with deep expertise simultaneously in ad technology, business management and statistics. While Google is doing better than many companies with their algorithms, I think they could do even better with less efforts, by focusing less on science and more on domain expertise.
This is precisely the gap that data science is trying to bridge.
Let me illustrate my point on this ad prediction technique, developed by Google.
1. What are we trying to accomplish?
That's the first question all data scientists should ask. Here, maybe Google is trying to maximize the number of clicks delivered to advertisers, to boost its revenue. Maybe the research paper is to be used by Google internally. Or maybe, the purpose is to help advertisers, who always want more clicks - as long as they are converting to sales.
2. Do we need a nuclear weapon to kill a fly?
Using billions of features, most of them almost never triggered, make no sense. How do you handle co-dependencies among these features, and what statistical significance do you get from 99.9% of the features that are triggered no more than 3 times in a 500 billion observations (clicks). Sure, you could do some feature blending and clustering - a rather expensive technique, computationally speaking - but I think this issue of feature aggregation was not even discussed in their paper.
Also, the vast majority of these features are probably automatically created, through feature generation algorithm, This is by far the most intriguing component of their system - but it is not discussed in the paper. It's a combinatorial optimization problem, looking at all relationships (ratios, products, log transformations and other mappings such as IP category) among a set of base metrics such as log-file fields, to discover features with predictive power.. Some features are also created in bulk by analysts looking at data. This set of billions of features could very well be missing 2 or 3 core (but non obvious) features that would make the algorithm far superior. Google does not mention any of the features used in their algorithm, in the paper in question.
I believe that you can solve this ad click prediction problem with just a couple of features (a feature is a variable) carefully selected by a domain expert. Here are the ones that I would choose, I believe they are unlikely to be created by an automated feature generation algorithm.
My recommended features, to predict ad click
Of course the best solution is to blend features like mine, with the top features detected by an automated, machine learning algorithm and analysts.
3. Where smart statistics help
I have developed hidden decision trees to solve this type of problems, precisely after noticing the very high sparsity of the feature space. Do we need logistic regression with a gradient algorithm? Do we really need an exact solution when the data itself is very messy? I bet you can do great predictions using only 20 carefully selected features, and that's where the data scientist can also help: applying his statistical knowledge to create a system that runs 1,000 faster, uses much less computer resources, and provides similar or better results. You don't even need to use standard techniques such as (robust) logistic regression. I've been working with model-free statistics for a long time, with great satisfaction, and yes, I also computed model-free confidence intervals.
Another area where statistics can help - if you really like working with billions of features - is in identifying features with predictive power. I'm sure that most of the billion features used by Google have no predictive power, actually predictive power is never discussed in their article. Sometimes two features have no predictive power, but when combined together they do. For example country (US vs. UK) and time of day have far greater predictive power when combined together. Statistical science can help define predictive power, and assess when it is significant. Finally, if you have billions of features, you will necessarily find features that seem to have predictive power, but actually don't. Worse: these spurious features might overshadow the ones that truly have predictive power, making your system prone to systemic errors, and resulting in chaotic predictions. The reason (and the fix) is explained in details in my article the curse of big data. It is an issue that should be addressed using ad-hoc statistical analyses, not the kind of stats currently taught in university curricula.