Subscribe to DSC Newsletter

10 types of regressions. Which one to use?

Should you use linear or logistic regression? In what contexts? There are hundreds of types of regressions. Here is an overview for data scientists and other analytic practitioners, to help you decide on what regression to use depending on your context. Many of the referenced articles are much better written (fully edited) in my data science Wiley book.

Click here to see source, for this picture

  • Linear regression: Oldest type of regression, designed 250 years ago; computations (on small data) could easily be carried out by a human being, by design. Can be used for interpolation, but not suitable for predictive analytics; has many drawbacks when applied to modern data, e.g. sensitivity to both ouliers and cross-correlations (both in the variable and observation domains), and subject to over-fitting. A better solution is piecewise-linear regression, in particular for time series.
  • Logistic regression: Used extensively in clinical trials, scoring and fraud detection, when the response is binary (chance of succeeding or failing, e.g. for a new tested drug or a credit card transaction). Suffers same drawbacks as linear regression (not robust, model-dependent), and computing regression coeffients involves using complex iterative, numerically unstable algorithm. Can be well approximated by linear regression after transforming the response (logit transform). Some versions (Poisson or Cox regression) have been designed for a non-binary response, for categorical data (classification), ordered integer response (age groups), and even continuous response (regression trees).
  • Ridge regression: A more robust version of linear regression, putting constrainsts on regression coefficients to make them much more natural, less subject to over-fitting, and easier to interpret. Click here for source code.
  • Lasso regression: Similar to ridge regression, but automatically performs variable reduction (allowing regression coefficients to be zero). 
  • Ecologic regression: Consists in performing one regression per strata, if your data is segmented into several rather large core strata, groups, or bins. Beware about the curse of big data in this context: if you perform millions of regressions, some will be totally wrong, and the best ones will be overshadowed by noisy ones with great but artificial goodness-of-fit: a big concern if you try to identify extreme events and causal relationships (global warming, rare diseases or extreme flood modeling). Here's a fix to this problem.
  • Regression in unusual spaces: click here for details. Example: to detect if meteorite fragments come from a same celestial body, or to reverse-engineer Coca-Cola formula.
  • Logic regression: Used when all variables are binary, typically in scoring algorithms. It is a specialized, more robust form of logistic regression (useful for fraud detection where each variable is a 0/1 rule), where all variables have been binned into binary variables.
  • Bayesian regression: see entry in Wikipedia. It's a kind of penalized likehood estimator, and thus somewhat similar to ridge regression: more flexible and stable than traditional linear regression. It assumes that you have some prior knowledge about the regression coefficients.and the error term - relaxing the assumption that the error must have a normal distribution (the error must still be independent across observations). However, in practice, the prior knowledge is translated into artificial (conjugate) priors - a weakness of this technique.
  • Quantile regression: Used in connection with extreme events, read Common Errors in Statistics page 238 for details.
  • LAD regression: Similar to linear regression, but using absolute values (L1 space) rather than squares (L2 space). More robust, see also our L1 metric to assess goodness-of-fit (better than R^2) and our L1 variance (one version of which is scale-invariant).
  • Jackknife regression: This is the new type of regression, also used as general clustering and data reduction technique. It solves all the drawbacks of traditional regression. It provides an approximate, yet very accurate, robust solution to regression problems, and work well with "independent" variables that are correlated and/or non-normal (for instance, data distributed according to a mixture model with several modes). Ideal for black-box predictive algorithms. It approximates linear regression quite well, but it is much more robust, and work when the assumptions of traditional regression (non correlated variables, normal data, homoscedasticity) are violated.

Note: Jackknife regression has nothing to do with Bradley Efron's Jackknife, bootstrap and other re-sampling techniques published in 1982; indeed it has nothing to do with re-sampling techniques.

Other Solutions

  • Data reduction can also be performed with our feature selection algorithm.
  • It's always a good idea to blend multiple techniques together to improve your regression, clustering or segmentation algorithms. An example of such blending is hidden decision trees.
  • Categorical independent variables such as race, are sometimes coded using multiple (binary) dummy variables.

Before working on any project, read our article on the lifecycle of a data science project.

Views: 230586

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on July 23, 2014 at 4:22am

Hi Kalyanaraman,

Can you elaborate? Jackknife regression addresses this issue. But you can also transform your data, use PCA to decorrelate the variables (I don't like it because the new variables lack interpretation). But maybe you had a another idea in mind.

Transforming your data is a bit risky in the context of black box, automated data science, because each time you add enough new data, you need to transform the whole data set again, it creates a bit of instability. Not an issue for small transformations involving one observation at a time, but for big transformations involving all observations simultaneously (e.g. Mahalanobis transforms).

If the error term in your model is auto-correlated, you might want to stratify your data and perform ecologic regression mentioned above.

Vincent

Comment by Kalyanaraman K on July 22, 2014 at 10:26pm
Allow me to add a wide class of alternative regressions associated with situations deviating from usual assumption on linear regression, like heteroscedasticity, autocorrelation and others.
Comment by Vincent Granville on July 22, 2014 at 3:44pm

Here is how to do these regressions using SAS:

  • Linear regression - PROC REG, GLM, GLIMMIX, MIXED and GLMSELECT can all be useful
  • Logistic regression - PROC LOGISTIC, GLIMMIX
  • Ridge regression - PROC REG can do it
  • Lasso - PROC GLMSELECT
  • Ecologic - first sort on the variable you want to stratify by, then add BY statement to the appropriate method
  • Regression in unusual spaces - will depend on the space
  • Quantile regression - PROC QUANTREG
  • LAD regression - PROC GLMSELECT

Thanks Peter Flom for providing this answer. 

Anyone interested in providing solutions (functions calls) in R, Python or using other packages?

Videos

  • Add Videos
  • View All

© 2019   Data Science Central ®   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service