Dangers of Using RMSE: Netflix Case Study


RMSE or Root Mean Squared Error is a widely used method to evaluate the effectiveness of a model. Mike De Waard has one of the simplest and clearest explanations on how it works:


The Root Mean Squared Error (RMSE or RMSD where D stands for deviation) is the square root of the mean of the squared differences between the actual value and predicted value. As this is might be hard to grasp, I'll explain it using an example. Suppose we have the following values:

The mean of these squared differences for the model is 4.33333, and the root of this is 2.081666. So in average, the model predicts the values with an error of 2.08. The lower this RMSE value is, the better the model is in its predictions. This is why in the field, when selecting features, one computes the RMSE with and without a certain feature, in order to say something about how that feature affects the performance of the model. With this information one can then decide whether the additional computation time for this feature is worth it in comparison to the improvement rate on the model.

RMSE can be calculated by below formula.


The functions & code to calculate RMSE in various languages and statistical languages is given by Fodop here.

Netflix learnt a very expensive lesson on some of the dangers of using RMSE. The ultimate prize winners blended hundreds of models to optimize for RMSE that couldn’t be implemented by Netflix:

A year into the competition, the Korbell team won the first Progress Prize with an 8.43% improvement. They reported more than 2000 hours of work in order to come up with the final combination of 107 algorithms that gave them this prize. And, they gave us the source code. We looked at the two underlying algorithms with the best performance in the ensemble: Matrix Factorization (which the community generally called SVD,Singular Value Decomposition) and Restricted Boltzmann Machines (RBM). SVD by itself provided a 0.8914 RMSE, while RBM alone provided a competitive but slightly worse 0.8990 RMSE. A linear blend of these two reduced the error to 0.88. To put these algorithms to use, we had to work to overcome some limitations, for instance that they were built to handle 100 million ratings, instead of the more than 5 billion that we have, and that they were not built to adapt as members added more ratings. But once we overcame those challenges, we put the two algorithms into production, where they are still used as part of our recommendation engine.


This is a truly impressive compilation and culmination of years of work, blending hundreds of predictive models to finally cross the finish line. We evaluated some of the new methods offline but the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.


So, the winners of the Netflix prize put in substantial effort by winning the prize & optimizing solely for RMSE but in the end because the larger IT & operational costs were not factored in, the models were not implemented.


Views: 7406


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Kevin R Keane on October 17, 2015 at 3:12am

The only thing illustrated here is be careful what you wish for.  The winners met their objective: winning the contest.  Indy Cars are not "street legal", and were never intended to be street legal. 

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service