Comparing Model Evaluation Techniques Part 3: Regression Models

In my previous posts, I compared model evaluation techniques using Statistical Tools & Tests and commonly used Classification and Clustering evaluation techniques

In this post, I’ll take a look at how you can compare regression models. Comparing regression models is perhaps one of the trickiest tasks to complete in the “comparing models” arena; The reason is that there are literally dozens of statistics you can calculate to compare regression models, including:

1. Error measures in the estimation period (in-sample testing) or validation period (out-of-sample testing):

2. Tests on Residuals and Goodness-of-Fit:

Plots: actual vs. predicted value; cross correlation; residual autocorrelation; residuals vs. time/predicted values,
Changes in mean or variance,
Tests: normally distributed errors; excessive runs (e.g. of positives or negatives); outliers/extreme values/ influential observations.

This list isn’t exhaustive–there are many other tools, tests and plots at your disposal. Rather than discuss the statistics in detail, I chose to focus this post on comparing a few of the most popular regression model evaluation techniques and discuss when you might want to use them (or when you might not want to). The techniques listed below tend to be on the “easier to use and understand” end of the spectrum, so if you’re new to model comparison it’s a good place to start.

Where to Start

The first question you should be asking is: How well do I know my data? In order to evaluate regression models, you need to know what results would be reasonable for your particular situation. For example, if you compare changes in mean or variance, one model might give you impossible results, another might be overly complicated for the task at hand. The ideal model isn’t one that’s just “correct”, it also needs to be relatively simple and useful for the decision making process–something that won’t be immediately obvious unless you know your data really well.

Which technique you choose is largely dependent on the software you have at hand (i.e. R, SPSS, or Excel). If you’re using Excel, a word of advice: stop. It was never designed for serious statistical work and has significant statistical problems. Duke University’s Robert Nau puts it best: “It’s a toy (a clumsy one at that), not a tool for serious work.” The number of models you’re testing also comes into play. Arguably, which statistic you use (r-squared, p-values etc.) are mostly personal preference (although the test you use might force that choice upon you). Each of the statistics has it’s pluses and minuses, its advantages and disadvantages. I won’t be bloating this article out with all the comparisons between the statistics, but if you’re interested I’ve linked where possible to articles that explain those in detail.

Nested Models

Nested models are models that are subsets of one another; If you can get one model by constraining the parameters of another, then that model is nested. Nested models require different techniques to evaluate models and there isn’t a single, agreed-upon way to test for the “best” model.

Possibly the easiest (it can be used with a very basic understanding of statistics) way to compare nested models is to simply measure how well each model performs reclassification. The “better” model will have higher rates of correct reclassification. A chi-square analysis can be used, although if you run a test for sphericity you must use a different chi-square value.

If you’re comparing nested models (perhaps you want to know if the simplest model is adequate), you can compare them with a t-statistic. You can only run a test for significance against a single extra coefficient. In other words, you can’t run it if you have more than one additional coefficient from one model to the next. This article has instructions in R, as well as a fairly detailed overview on running the general regression test or the extra sum of squares test.

According to Calvin Garbin of the University of Nebraska Lincoln, with SPSS you can compare nested models in two different ways using r-squared:

Get the multiple regression results for each model, then compare the models using the FZT Computator’s R²
change F-test.
Change from one model to another in SPSS, calculating the R²-change F-test. Although convenient, this doesn’t always calculate the statistic correctly.

Gabin’s article has a couple of excellent examples of how to perform the above tasks as well as SPSS procedures for comparing non-nested models using correlations.

An ANOVA F-test can compare two nested models, where one is a subset of the other. It tests a single predictor variable, but can be used to test multiple predictors at a time.

Multiple models can be compared using forward selection, backward elimination, or stepwise selection. Basically, these are all variants of each other and involve removing predictors with the smallest f-value / t-value or largest associated p-value. These techniques can only be used on nested models, but they can all miss optimal models and–if you run all three on the same models–they may not agree with each other.

Non Nested Models

Non nested models have fewer options for comparison between models. As the models aren’t nested, neither will your results (e.g. a chi-square statistic). In layman’s terms, if your models are nested then you’re comparing apples to apples, which is much easier than comparing apples to oranges.

One of the simplest comparison methods is the Bayesian information criterion. Despite the daunting math behind the calculations, most statistical software will calculate the BIC for each model. This leaves you to simply interpret the results: The model with the lowest BIC is considered the best. It’s often preferred over other Bayesian methods like Bayes Factors, because BIC doesn’t require you to have knowledge about priors.

Akaike’s Information Criterion is similar to BIC, except that the BIC tends to favor models with fewer parameters. AIC ranks each model from best to worst. A major downside is that it doesn’t say anything about quality; It will choose the “best” even if you input a series of poor quality models.

The benefit of the Cox test is it’s relatively simple (in comparison to the BIC or AIC) to understand what the test is doing behind the scenes. Let’s say you were comparing models A and B. If model A contains the the correct regressors, then those regressors fit from model B to model A should yield zero further explanatory value. If there is further explanatory value, then model 1 doesn’t contain the correct regressor set. You run the test twice–the second time from B to A–and compare your findings. See: Performing the Cox Test in R.