Should you use linear or logistic regression? In what contexts? There are hundreds of types of regressions. Here is an overview for data scientists and other analytic practitioners, to help you decide on what regression to use depending on your context. Many of the referenced articles are much better written (fully edited) in my data science Wiley book.
Click here to see source, for this picture
Note: Jackknife regression has nothing to do with Bradley Efron's Jackknife, bootstrap and other re-sampling techniques published in 1982; indeed it has nothing to do with re-sampling techniques.
Other Solutions
Before working on any project, read our article on the lifecycle of a data science project.
Comment
Nice thumbnail outline. FYI, the term 'jackknife' also was used by Bottenberg and Ward, Applied Multiple Linear Regression, in the '60s and 70's, but in the context of segmenting. As mentioned by Kalyanaraman in this thread, econometrics offers other approaches to addressing multicollinearity, autocorrelation in time series data, solving simultaneous equation systems, heteroskedasticity, and over- and under-identification.
I'm puzzled why there isn't more attention here to the underlying model. If you have strong reason to believe that the underlying model is linear, then linear regression is fine. If you have strong reason to believe it's sigmoidal, then linear regression is an unlikely candidate. What it usually boils down to, in my experience, is defining the model, and defining the norm. Answers to those two questions pretty much define the problem that you are solving, and given that, there is a (usually) unique solution. It is frustrating to me when I see people typing stuff in at the keyboard but they don't have a solid description of the problem they are solving. Once you have that problem definition, the specific method of solution is often pretty clear.
Another type of regression that I find very useful is Support Vector Regression, proposed by Vapnik, coming in two flavors:
SVR - (python - sklearn.svm.SVR) - regression depends only on support vectors from the training data. The cost function for building the model ignores any training data epsilon-close to the model prediction.
NuSVR - (python - sklearn.svm.NuSVR), enabling to limit the number of support vectors used by the SVR.
As in support vector classification, in SVR different kernels can be used in order to build more complex models using the kernel trick.
What are folks thoughts on MARS (Multivariable Adaptive Regression Spines) as far as regression techniques? R: earth. Python: py-earth Salford Systems own the MARS implementation.
http://www.slideshare.net/salfordsystems/evolution-of-regression-ol...
I'd love to see a case study, to show how different methods provide different results.
About R implementations, here is a comment by Alan Parker (see also Amy's comment below):
The CRAN task view: “Robust statistical methods” gives a long list of regression methods, including many that Vincent mentions. Here a some that are not mentioned there:
Regression in unusual spaces. This subject is old. It is usually addressed under the title “Compositional data” (see Wikipedia entry). The late John Aitchison founded this area of statistics. Googling his name + “compositional data” gives access to a number of his articles. The R package “compositions” deals with it comprehensively. Another package treats the problem using robust statistics: “robCompositions”.
Bayesian regression. I find Bayesian stuff conceptually hard, so I am using John Kruschke’s friendly book: “Doing Bayesian data analysis”. Chapter 16 is on linear regression. He provides a free R package to carry out all the analyses in the book. The CRAN view “Bayesian” has many other suggestions. Package BMA does linear regression, but packages for Bayesian versions of many other types of regression are also mentioned.
I think what Kalyanaraman has in mind is auto-regressive models for time series, like ARIMA processes and Box & Jenkins types of tools to estimate the parameters. A simple form is x(t) = a * x(t-1) + b * x(t-2) + error, where t is the time, a, b are the "regression" coefficients, and a, b are positive numbers satisfying a + b = 1 (otherwise the time series explodes).
© 2016 Data Science Central Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
You need to be a member of Data Science Central to add comments!
Join Data Science Central