Why you should NEVER run a Logistic Regression (unless you have to)

Hello fellow Data Science-Centralists!

I wrote a post on my LinkedIn about why you should NEVER run a Logistic Regression. (Unless you really have to).

The main thrust is:

  • There is no theoretical reason why a least squares estimator can't work on a 0/1.
  • There are very very narrow theoretical reasons that you want to run a logistic, and unless you fall into those categories it's not worth the time.
  • The run time of a logistic can be up to 100x longer than an OLS model. If you are doing v-fold cross-validation save yourself some time.
  • The XB's are exactly the same whether you use a Logistic or a linear regression. The model specification (features, feature engineering, feature selection, interaction terms) are identical -- and this is what you should be focused on anyways.
  • Myth: Linear regression can only run linear models.
  • There is *one* practical reason to run a logistic: if the results are all very close to 0 or to 1, and you can't hard code your prediction to 0 or 1 if the linear models falls outside a normal probability range, then use the logistic. So if you are pricing an insurance policy based on risk, you can't have a hard-coded 0.000% prediction because you can't price that correctly.

See video here and slides here.

I think it'd be nice to start a debate on this topic!

Views: 1978

Tags: dsc_analytics, dsc_tagged, logistic, regression


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by JON DICKENS on October 15, 2020 at 6:56pm

I think that you should the best most appropriate method for the given problem and data quality

When the target variable of your classification / predictive model is binary then I would use Logistic Regression

When the target variable of your classification / predictive model is ordinal with more than two values then I would use Ordinal Logistic Regression.

For me the quality of the analytical solution is more important than raw speed as there many other factors in the model development / model validation processes which take much more time that running  an analytical program

Comment by Vincent Granville on October 15, 2020 at 7:48am

I agree. When faced with a logistic regression, I just apply the logit transform to the response (the dependent variable) and run a standard linear regression on the transformed response. The results are essentially the same, and it is much faster.

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service