Earlier it was Random forest , the go-to algorithm for classification problems in most of the data science competitions. Correctly formulated problem , with smart feature engineering and minimal tuning of the RF algorithm ( ntree, mtry) using grid search could get you past the bulk of the crowd .
Then came Xgboost and it soon became the hot favorite. It isn't very tough to say Deep learning is running the show at the moment. Although, GPU powered deep learning frameworks, weren't accessible to everyone . The ones who could use it were reaping the benefits.
Then arrived H2o, bringing deep learning to R with ease; (although Darch and deepnet were already available in R, not as popular though) .
1 .MNIST digit recognition Competition
Here is a demonstration of how deep learning made the lunch of the classic MNIST dataset, A digit recognition data, being used since long in the academic and research arena. A one-liner R code running a deep learning algorithm with 3 hidden layers each having 1024,1024,2048 neurons respectively , the non-linear differentiable activation function being rectifier with dropout; achieved an error rate of 0.83 % on the test data ! A world record on the data set : No distortion, no convolution, no ensemble, no unsupervised learning !
2. Airbnb competition (reward : potential interview at airbnb) : On Going - 10 days left
In this recruiting competition, Airbnb challenges you to predict in which country a new user will make his or her first booking. Kagglers who impress with their answer (and an explanation of how they got there) will be considered for an interview for the opportunity to join Airbnb's Data Science and Analytics team.
A lazily built xgboost algorithm gets you a 0.86 score on the leaderboard , while the leading team is at barely 0.88.
Want to get better score? If are not feeling lazy , you gotta do some hyper parameter tuning. A grid search ? I know , computationally expensive and too slow to wait for long . I cut it down in 2 minutes, as there was no motivation . Lol
Set the grid search to tune the hyper-parameters.
Set the train control parameter to use k-fold cross validation.
Train the model on the tuned hyper parameters that gave the best accuracy as per the cross validation.
Wait, You could even try Optunity to optimize the hyper parameter tuning and achieve even better results with xgboost model.
Use grid search to find the max-depth that maximizes AUC-ROC in twice iterated 5-fold cross-validation:
Hopefully it will get you closer to the finish line, who knows may be a call from airbnb ;)