Why Logistic Regression should be the last thing you learn when becoming a Data Scientist

I recently read a very popular article entitled 5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist. Here I provide my opinion on why this should no be the case.

It is nice to have logistic regression on your resume, as many jobs request it, especially in some fields such as biostatistics. And if you learned the details during your college classes, good for you. However, for a beginner, this is not the first thing you should learn. In my career, being an isolated statistician (working with marketing guys, sales people, or engineers) in many of my roles, I had the flexibility to choose which tools and methodology to use. Many practitioners today are in a similar environment. If you are a beginner, chances are that you would use logistic regression as a black-box tool with little understanding about how it works: a recipe for disaster. 

Here are 5 reasons it should be the last thing to learn:

  • There are hundred of types of logistic regression, some for categorical variables, some with curious names such as Poisson regression. It is confusing for the expert, and even more for the beginner, and for your boss.
  • If you transform your response (often a proportion or a binary response such as fraud or no fraud in this context) you can instead use a linear regression. While purists claim that an actual logistic regression is more precise (from a theoretical perspective), model precision is irrelevant: it is the quality of your data that matters. A model with 1% extra accuracy does not help if your data has 20% of noise, or your theoretical model is a rough approximation of the reality.
  • Unless properly addressed (e.g. using ridge or Lasso regression), it leads to over-fitting and lack of robustness against noise, missing or dirty data. The algorithm to compute the coefficients is unnecessarily complicated (iterative algorithm using techniques such as gradient optimization.) 
  • The coefficients in a logistic regression are not easy to interpret. When you explain your model to decision makers or non-experts, few will understand.
  • The best models usually blend multiple methods together to capture / explain as much variance as possible. I never used pure logistic regression in my 30-year long career as a data scientist, yet I have developed an hybrid technique that is more robust, easier to use and to code, and leading to simple interpretation. It blends "impure" logistic regression with "impure" decision trees in a way that works great, especially for scoring problems in which your data is "impure." You can find the details here.

For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.

DSC Resources

Views: 37719

Tags: logistic regression


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Dean Abbott on January 21, 2019 at 5:57pm

I've often used linear regression for binary classification. Yes it's true that it maps from -inf to +inf instead of [0,1] (because of the logistic curve transformation), but I've had linear outperform logistic for classification in the past. And I've confirmed this finding with several Professors of Statistics across the country (because I started asking!). "How is that possible?" you might ask. 

One answer is that it depends on the metric you use for goodness of fit. Sometimes for classification we are not interested in classification error per se, or Type I vs. Type II errors (or Precision / Recall or Sensitivity / Specificity or any "Confusion Matrix" metric). These score the entire population. I presume that logistic will usually win there. Sometimes however I am looking for "lift in the top decile" or even "false alarm rate in the top 100 scores" (both of those are actual metrics I have used in models delivered to clients that were put into production). In this case, linear sometimes has won. One reason I think this occurs is that the sensitivity to the tails of distributions is more acute in linear that logistic (because logistic squashes influence of errors as the prob. approaches 0 or 1). But I've never proved it (maybe I should!). I just know that empirically, I keep linear regression in the mix for some problems (based on the error metric I am using). 

Comment by Vincent Granville on May 24, 2018 at 1:16pm

Dear R Bohn,

We do not delete comments and we like to have contrasted opinions. My main argument is that logistic regression should not be taught first. If you are working with a team of statisticians, and the analysis performed correctly, and the data is sound, I have no doubt that you will understand each other. But when explaining your results to your CEO or engineers, you may face incomprehension. This is just similar to my arguments that you don't need statistical tests of hypotheses or p-values. Engineers (including myself, though technically being a statistician) have been using alternate models that produce the same results if not better (being data-driven rather than model-driven), easier to understand, and you don't even need to know what a random variable is to understand how it works.


Comment by R Bohn on May 24, 2018 at 12:21pm

I am sorry to report that this article is nonsense.  It's not the conclusion - use it or don't use it, there are now many alternatives to logistic regression. (Which in the machine learning world is a "linear classifier." ) 

The difficulty is that most of the discussion is Just Wrong. Analytically incorrect. No correspondence to the usual definitions, use, and interpretation of logistic regression.

  • The diagram is incomprehensible. If it is intended to be the standard representation of logistic regression, it has multiple errors.
    • LR maps from -infinity to +infinity (on the X scale), not from 0 to 1. 
    • The y axis is correct.  
    • The colors and the points show the curve (called the logistic curve or similar) as the boundary between positive and negative outcomes, for points defined by two independent variables (here x and y). That is not at all what the curve means. See e.g. https://en.wikipedia.org/wiki/File:Logistic-curve.svg
  • "There are hundreds of types of logistic regression." Maybe in a world with a different definition, but the standard definition does not include Poisson models. Of course as always there are a variety of possible algorithms that can be used to solve a logistic model. 
    • From https://www.medcalc.org/manual/logistic_regression.php  "Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).
      In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.)."
  • "If you transform your variable you can instead use linear regression." Yes, and that is how logistic regressions are usually solved! That is, LRs are solved by transforming the variables (using a logit transform ) and solving the resulting equation, which is linear in the variables. In practice, many other transformation equations can be used instead, but the logit transform has a nice interpretation.
    • where 
  • "Coefficients are not easy to interpret." I suppose that easy is in the eye of the beholder, but there is a standard and straightforward interpretation. 
    • "The logistic regression coefficients show the change in the predicted logged odds of having the characteristic of interest for a one-unit change in the independent variables." It does take a few examples to figure out what "log odds" means, unless you do a lot of horse racing. But after that, it is a clever and powerful way to think about changes in the probability of an outcome. 
    • The (corrected) version of the logistic curve corresponds to an equivalent way to interpret the coefficient values. 

There certainly are some mild criticisms of logistic regression, but in situations where a linear model is reasonably accurate, it is a good quick model to try. Of course, if the situation is highly nonlinear, a tree model is going to be better. Furthermore, the particular logistic equation generally used should not be considered sacred. 

My interpretation is that this article is an attack on a straw man, an undefined  and radically unconventional model that is here being called  "logistic regression." It would be a shame if anyone took it seriously. We will see if the author/site manager leaves this comment up. If he does, I invite him to respond and explain  the meaning of diagram. 

By the way, I agree with  much of the discussion on the medcalc website I'm quoting, but not all of it. 

Comment by Rodolfo Antonio Muriel Rodríguez on May 22, 2018 at 8:10am

Vincent Granville Thanks for the answer; Do you recommend any book?

Comment by Vincent Granville on May 22, 2018 at 7:03am

First things to learn should probably be:

  • On overview of how algorithms work
  • Different types of data and data issues (missing data, duplicated data, errors in data)
  • How to identify useful metrics
  • Lifecycle of data science projects
  • Introduction to programming languages
  • Communicating results to non experts and understanding requests from decision makers (translating requests into action items for the data scientist)
  • Overview of popular techniques with pluses and minuses, and when to use them
  • Case studies
Comment by Rodolfo Antonio Muriel Rodríguez on May 22, 2018 at 6:06am

Thanks for the article...but what should be the first thing to learn?

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service