R-squared can help you answer the question "How does my model perform, compared to a naive model?". However, r2 is far from a perfect tool. Probably the main issue is that every data set contains a certain amount of unexplainable data. R-squared can't tell the difference between the explainable and the unexplainable , so it will go on, and on, trying to perfect its goal. If you keep on "perfecting" with r-squared by adding more predictors, you'll end up with misleading results, and reduced precision.Some of the other issues include:
Pluses and Minuses with R-Squared
In general, r-squared is often the "go to" because it's easy to use and understand. Practically all statistical software includes it--even basic tools like Excel's data analysis. You simply check a box, et voila! The software gives you the percent variation in y that your model explains.
One of the main drawbacks to r2 is that you can keep on adding terms to your model to increase it. Model not quite up to par? Add a few more terms and it will be. Add a few more, and any model--even the bad ones-- could hit 99%. Basically, if you don't know your data inside out (and for large sets of data, you probably won't), then it can be challenging to know when to stop adding terms.
A perfect alternative to r-squared doesn't exist: every choice has its pluses and minuses.
Adjusted R-Squared is a correction for adding too many terms to the model. It will always be lower than R-squared, and tends to be a better alternative. However, it suffers from many of the same pitfalls of plain old r2. Perhaps the most important drawback is that it isn't predictive, and simply deals with the data you feed into it.
Predicted R-squared (PRESS or predicted sum of squares) avoids the "it only deals with the data at hand" problem. It gauges how well a model accounts for new observations. As well as its predictive capabilities, a key plus is that it can help prevent overfitting; if there is a large difference between your r2 and predictive r2 values, that's an indication you have too many terms in your model.
A big minus: it is not widely available. At the time of writing, Excel doesn't include it, nor does SPSS (although they have published a workaround). A few options:
There are more more than a few formulas for r-squared (defining them all is beyond the scope of this article, but if you're interested, see Kvalseth et. al, as cited in Alexander et. al). A simple, and common formula is shown below. Some alternatives to this particular formula include using the median instead of the summation (Rousseeuw), or absolute values of the residuals instead of the square (Seber).
More formula tweaks deal specifically with the problem of outliers. Having them in your model could pose a problem: least squares r-squared for variable selection in a linear regression model is sensitive to outliers. According to Croux & Dehon, the addition of just one single outlier can have a drastic effect. An alternative, R2LTS, ( Saleh), uses least trimmed squares; The author reports that this method is not sensitive to outliers. Rousseeuw uses an M-estimator to achieve a similar effect.
Sometimes, you just can't avoid reporting r2, especially if you're publishing a paper. If you can't avoid r-squared, the alternative is to use it, but wisely. Alexander et al. suggest that you:
Kvalseth et. al. Cautionary note about R 2. The American Statistician, November, 1985
Rousseeuw PJ. Least Median of Squares Regression. J. Am. Stat. Assoc. 1984;79:871–880
Seber GAF. Linear Regression Analysis. John Wiley & Sons; NY: 1977. p. 465