Subscribe to Dr. Granville's Weekly Digest

10 types of regressions. Which one to use?

Should you use linear or logistic regression? In what contexts? There are hundreds of types of regressions. Here is an overview for data scientists and other analytic practitioners, to help you decide on what regression to use depending on your context. Many of the referenced articles are much better written (fully edited) in my data science Wiley book.

Click here to see source, for this picture

  • Linear regression: Oldest type of regression, designed 250 years ago; computations (on small data) could easily be carried out by a human being, by design. Can be used for interpolation, but not suitable for predictive analytics; has many drawbacks when applied to modern data, e.g. sensitivity to both ouliers and cross-correlations (both in the variable and observation domains), and subject to over-fitting. A better solution is piecewise-linear regression, in particular for time series.
  • Logistic regression: Used extensively in clinical trials, scoring and fraud detection, when the response is binary (chance of succeeding or failing, e.g. for a new tested drug or a credit card transaction). Suffers same drawbacks as linear regression (not robust, model-dependent), and computing regression coeffients involves using complex iterative, numerically unstable algorithm. Can be well approximated by linear regression after transforming the response (logit transform). Some versions (Poisson or Cox regression) have been designed for a non-binary response, for categorical data (classification), ordered integer response (age groups), and even continuous response (regression trees).
  • Ridge regression: A more robust version of linear regression, putting constrainsts on regression coefficients to make them much more natural, less subject to over-fitting, and easier to interpret. Click here for source code.
  • Lasso regression: Similar to ridge regression, but automatically performs variable reduction (allowing regression coefficients to be zero). 
  • Ecologic regression: Consists in performing one regression per strata, if your data is segmented into several rather large core strata, groups, or bins. Beware about the curse of big data in this context: if you perform millions of regressions, some will be totally wrong, and the best ones will be overshadowed by noisy ones with great but artificial goodness-of-fit: a big concern if you try to identify extreme events and causal relationships (global warming, rare diseases or extreme flood modeling). Here's a fix to this problem.
  • Regression in unusual spaces: click here for details. Example: to detect if meteorite fragments come from a same celestial body, or to reverse-engineer Coca-Cola formula.
  • Logic regression: Used when all variables are binary, typically in scoring algorithms. It is a specialized, more robust form of logistic regression (useful for fraud detection where each variable is a 0/1 rule), where all variables have been binned into binary variables.
  • Bayesian regression: see entry in Wikipedia. It's a kind of penalized likehood estimator, and thus somewhat similar to ridge regression: more flexible and stable than traditional linear regression. It assumes that you have some prior knowledge about the regression coefficients.and the error term - relaxing the assumption that the error must have a normal distribution (the error must still be independent across observations). However, in practice, the prior knowledge is translated into artificial (conjugate) priors - a weakness of this technique.
  • Quantile regression: Used in connection with extreme events, read Common Errors in Statistics page 238 for details.
  • LAD regression: Similar to linear regression, but using absolute values (L1 space) rather than squares (L2 space). More robust, see also our L1 metric to assess goodness-of-fit (better than R^2) and our L1 variance (one version of which is scale-invariant).
  • Jackknife regression: This is the new type of regression, also used as general clustering and data reduction technique. It solves all the drawbacks of traditional regression. It provides an approximate, yet very accurate, robust solution to regression problems, and work well with "independent" variables that are correlated and/or non-normal (for instance, data distributed according to a mixture model with several modes). Ideal for black-box predictive algorithms. It approximates linear regression quite well, but it is much more robust, and work when the assumptions of traditional regression (non correlated variables, normal data, homoscedasticity) are violated.

Note: Jackknife regression has nothing to do with Bradley Efron's Jackknife, bootstrap and other re-sampling techniques published in 1982; indeed it has nothing to do with re-sampling techniques.

Other Solutions

  • Data reduction can also be performed with our feature selection algorithm.
  • It's always a good idea to blend multiple techniques together to improve your regression, clustering or segmentation algorithms. An example of such blending is hidden decision trees.
  • Categorical independent variables such as race, are sometimes coded using multiple (binary) dummy variables.

Before working on any project, read our article on the lifecycle of a data science project.

Views: 19477

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by J.T. Radman on July 31, 2014 at 7:42pm

What are folks thoughts on MARS (Multivariable Adaptive Regression Spines) as far as regression techniques?  R: earth. Python: py-earth Salford Systems own the MARS implementation.

http://www.slideshare.net/salfordsystems/evolution-of-regression-ol...

Comment by Iga Korneta on July 30, 2014 at 8:48am

I'd love to see a case study, to show how different methods provide different results.

Comment by Vincent Granville on July 24, 2014 at 6:37am

About R implementations, here is a comment by Alan Parker (see also Amy's comment below):

The CRAN task view: “Robust statistical methods” gives a long list of regression methods, including many that Vincent mentions. Here a some that are not mentioned there: 

Regression in unusual spaces. This subject is old. It is usually addressed under the title “Compositional data” (see Wikipedia entry). The late John Aitchison founded this area of statistics. Googling his name + “compositional data” gives access to a number of his articles. The R package “compositions” deals with it comprehensively. Another package treats the problem using robust statistics: “robCompositions”. 

Bayesian regression. I find Bayesian stuff conceptually hard, so I am using John Kruschke’s friendly book: “Doing Bayesian data analysis”. Chapter 16 is on linear regression. He provides a free R package to carry out all the analyses in the book. The CRAN view “Bayesian” has many other suggestions. Package BMA does linear regression, but packages for Bayesian versions of many other types of regression are also mentioned. 

Comment by Kalyanaraman K on July 24, 2014 at 5:00am
Yes. ARIMA is one among the models I considered.
Comment by Mirko Krivanek on July 23, 2014 at 8:09am

I think what Kalyanaraman has in mind is auto-regressive models for time series, like ARIMA processes and Box & Jenkins types of tools to estimate the parameters. A simple form is x(t) = a * x(t-1) + b * x(t-2) + error, where t is the time, a, b are the "regression" coefficients, and a, b are positive numbers satisfying a + b = 1 (otherwise the time series explodes).

Comment by Amy on July 23, 2014 at 7:48am

Bayesian regression was added later. Here's how to do it in SAS, courtesy of one of our readers, Ralph Winters:

For Bayesian analysis in SAS, you can use Proc MCMC, or do some post Bayesian type comparisons using proc GENMOD with the Bayes option, or even proc Logistic.

Ralph Winters
Data Architect at EmblemHealth

Comment by Amy on July 23, 2014 at 7:45am

Here's how to do it in R, courtesy of one of our readers, Blaise F Egan:

  • Linear and logistic regression are in the base stats module.
  • Ridge regression and Lasso regression are available in the 'glmnet' package.
  • Quantile regression is available in the 'quantreg' package.
  • I think LAD regression could be implemented using one of the optimisers, such as 'optim'.
  • To me, 'ecological regression' would suggest doing a linear regression with aggregated variables, not a separate technique, but I might be wrong.
Comment by Kalyanaraman K on July 23, 2014 at 5:05am
Hi Vincent
I was thinking about the class of regressions where the data vary over time, say time series. You may know that Econometric Methods contain a lot of alternative versions of regressions depending upon the type of violation of basic assumptions of linear model. You are right when you say jackknife and transformations may take some of these issues but not all. Thus there are regressions with appropriate transformations to control heteroscedasticity ; regressions with AR(1) disturbances; regressions with distributed lags or geometric lag structure of explanatory variables; regressions with lagged explained variables leading to partial adjustment and adaptive expectation model; regressions with stochastic regressors; regressions with error in measurement leading to regression with instrumental variables. Above all the problem of co-integrated models in regression. I was just adding to your list. 
Kalyanaraman
Comment by Vincent Granville on July 23, 2014 at 4:22am

Hi Kalyanaraman,

Can you elaborate? Jackknife regression addresses this issue. But you can also transform your data, use PCA to decorrelate the variables (I don't like it because the new variables lack interpretation). But maybe you had a another idea in mind.

Transforming your data is a bit risky in the context of black box, automated data science, because each time you add enough new data, you need to transform the whole data set again, it creates a bit of instability. Not an issue for small transformations involving one observation at a time, but for big transformations involving all observations simultaneously (e.g. Mahalanobis transforms).

If the error term in your model is auto-correlated, you might want to stratify your data and perform ecologic regression mentioned above.

Vincent

Comment by Kalyanaraman K on July 22, 2014 at 10:26pm
Allow me to add a wide class of alternative regressions associated with situations deviating from usual assumption on linear regression, like heteroscedasticity, autocorrelation and others.

Follow Us

Videos

  • Add Videos
  • View All

© 2014   Data Science Central

Badges  |  Report an Issue  |  Terms of Service