Home » Technical Topics » Data Science

Why you should NEVER run a Logistic Regression (unless you have to)

Hello fellow Data Science-Centralists!

I wrote a post on my LinkedIn about why you should NEVER run a Logistic Regression. (Unless you really have to).

The main thrust is:

  • There is no theoretical reason why a least squares estimator can’t work on a 0/1.
  • There are very very narrow theoretical reasons that you want to run a logistic, and unless you fall into those categories it’s not worth the time.
  • The run time of a logistic can be up to 100x longer than an OLS model. If you are doing v-fold cross-validation save yourself some time.
  • The XB’s are exactly the same whether you use a Logistic or a linear regression. The model specification (features, feature engineering, feature selection, interaction terms) are identical — and this is what you should be focused on anyways.
  • Myth: Linear regression can only run linear models.
  • There is *one* practical reason to run a logistic: if the results are all very close to 0 or to 1, and you can’t hard code your prediction to 0 or 1 if the linear models falls outside a normal probability range, then use the logistic. So if you are pricing an insurance policy based on risk, you can’t have a hard-coded 0.000% prediction because you can’t price that correctly.

See video here and slides here.

I think it’d be nice to start a debate on this topic!