- What is Kaggle and what new ideas brings to the predictive analytics arena?
Kaggle aims to help companies and researchers make predictions more precise by providing a platform for data prediction competitions. Competitions turn out to be a great way to get the most out of a dataset. This is because there are infinitely many approaches to any data modeling problem. By opening up a data prediction problem to a wide audience, a competition makes it possible to get to the frontier of what is possible given a dataset's inherent noise and richness.
- Can you tell us more about "real-time science" and how it could help Research globally?
Data modeling competitions can facilitate real-time science. Consider the recent announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modeling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).
Data modeling competitions also benchmark, in real time, new techniques against old. A technique that performs well in competitions can prove its mettle long before any paper can be published, helping the science to progress more quickly.
Competitions also help to avoid situations where valuable techniques are overlooked by the scientific establishment. This aspect of the case for competitions is neatly illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference. According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’. It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize (he called his Netflix Prize team NIPS_reject).
- How can companies benefit through Kaggle?
Companies can use Kaggle to gain an advantage over their competitors. Consider a bank that wants to improve the algorithms that vet loan applicants. If a bank can develop a more effective algorithm they will have fewer defaults and can charge lower interest rates than their competitors. Kaggle has proven to be an effective way to improve existing models very quickly.
Competitions are also really useful to companies that want to develop new products and capabilities. Consider a hedge fund that wants to be able to generate long-range weather forecasts in key agriculture regions. They can attempt to hire a weather forecasting expert or they can use Kaggle to throw the problem open to a wide audience. Using Kaggle they can be sure they'll get great results very quickly.
- How is the best model selected?
The competition host will typically split their dataset into two parts - a training dataset and a test dataset. The training dataset includes all explanatory variables as well as the dependent variable (or the answer). The test dataset also includes all the explanatory variables but the dependent variable (or answer) is withheld.
Participants train their models on the training dataset. They then apply their models to generate predictions on the test dataset. Those predictions are then scored on-the-fly against the actual answers (using one of several evaluation methods). Once the competition deadline passes, the team that generates the most accurate predictions gives the winning methodology to the competition host in exchange for the prize money.