Interesting article posted by By John Langford. John Langford is a machine learning research scientist, a field which he says "is shifting from an academic discipline to an industrial tool". He is the author of the blog hunch.net. John works at Microsoft Research, and was previously affiliated with Yahoo Research,Toyota Technological Institute, and IBM's Watson Research Center. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor's degree in 1997, and received his PhD in Computer Science from Carnegie Mellon University in 2002.
Here I have included excerpts of his article, with a link to the full version at the bottom.
Attempts to abstract and study machine learning are within some given framework or mathematical model. It turns out that all of these models are significantly flawed for the purpose of studying machine learning. I've created a table (below) outlining the major flaws in some common models of machine learning.
The point here is not simply "woe unto us". There are several implications which seem important.
Here is a summary what is wrong with various frameworks for learning. To avoid being entirely negative, I added a column about what's right as well.
Methodology: You specify a prior probability distribution over data-makers,P(datamaker) then use Bayes law to find a posterior P(datamaker|x). True Bayesians integrate over the posterior to make predictions while many simply use the world with largest posterior directly.
What is wrong:
Methodology: Sometimes Bayesian and sometimes not. Data-makers are typically assumed to be IID samples of fixed or varying length data. Data-makers are represented graphically with conditional independencies encoded in the graph. For some graphs, fast algorithms for making (or approximately making) predictions exist.
What is wrong:
Convex Loss Optimization
Methodology: Specify a loss function related to the world-imposed loss fucntion which is convex on some parametric predictive system. Optimize the parametric predictive system to find the global optima.
What is wrong:
Methodology: Specify an architecture with free parameters and use gradient descent with respect to data to tune the parameters.
What is wrong:
Methodology: You chose a kernel K(x,x') between datapoints that satisfies certain conditions, and then use it as a measure of similarity when learning.
What is wrong: Specification of the kernel is not easy for some applications (this is another example of prior elicitation). O(n2) is not efficient enough when there is much data.
Methodology: You create a learning algorithm that may be imperfect but which has some predictive edge, then apply it repeatedly in various ways to make a final predictor.
What is wrong: The boosting framework tells you nothing about how to build that initial algorithm. The weak learning assumption becomes violated at some point in the iterative process.
Online Learning with Experts
Methodology: You make many base predictors and then a master algorithm automatically switches between the use of these predictors so as to minimize regret.
What is wrong: Computational intractability can be a problem. This approach lives and dies on the effectiveness of the experts, but it provides little or no guidance in their construction.
Methodology: You solve complex machine learning problems by reducing them to well-studied base problems in a robust manner.
What is wrong: The existence of an algorithm satisfying reduction guarantees is not sufficient to guarantee success. Reductions tell you little or nothing about the design of the base learning algorithm.
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as finding a near-best hypothesis amongst a given set of hypotheses in a computationally tractable manner.
What is right: The focus on computation is pretty right-headed, because we are ultimately limited by what we can compute.
What is wrong: There are not many substantial positive results, particularly when D is noisy. Data isn’t IID in practice anyways.
Statistical Learning Theory
Methodology: You assume that samples are drawn IID from an unknown distribution D. You think of learning as figuring out the number of samples required to distinguish a near-best hypothesis from a set of hypotheses.
What is wrong: The data is not IID. Ignorance of computational difficulties often results in difficulty of application. More importantly, the bounds are often loose (sometimes to the point of vacuousness).
Decision tree learning
Methodology: Learning is a process of cutting up the input space and assigning predictions to pieces of the space.
What is wrong: There are learning problems which can not be solved by decision trees, but which are solvable. It’s common to find that other approaches give you a bit more performance. A theoretical grounding for many choices in these algorithms is lacking.
Methodology: Learning is about finding a program which correctly predicts the outputs given the inputs.
What is wrong: The theory literally suggests solving halting problems to solve machine learning.
RL, MDP learning
Methodology: Learning is about finding and acting according to a near optimal policy in an unknown Markov Decision Process.
What is wrong: Has anyone counted the number of states in real world problems? We can’t afford to wait that long. Discretizing the states creates a POMDP (see below). In the real world, we often have to deal with a POMDP anyways.
RL, POMDP learning
Methodology: Learning is about finding and acting according to a near optimaly policy in a Partially Observed Markov Decision Process
What is wrong: All known algorithms scale badly with the number of hidden states.
This set is incomplete of course, but it forms a starting point for understanding what’s out there. (Please fill in the what/pro/con of anything I missed.)