I recently read a very popular article entitled 5 Reasons “Logistic Regression” should be the first thing you learn when becoming a Data Scientist. Here I provide my opinion on why this should no be the case.
It is nice to have logistic regression on your resume, as many jobs request it, especially in some fields such as biostatistics. And if you learned the details during your college classes, good for you. However, for a beginner, this is not the first thing you should learn. In my career, being an isolated statistician (working with marketing guys, sales people, or engineers) in many of my roles, I had the flexibility to choose which tools and methodology to use. Many practitioners today are in a similar environment. If you are a beginner, chances are that you would use logistic regression as a black-box tool with little understanding about how it works: a recipe for disaster.
Here are 5 reasons it should be the last thing to learn:
For related articles from the same author, click here or visit www.VincentGranville.com. Follow me on on LinkedIn.
DSC Resources
Comment
I've often used linear regression for binary classification. Yes it's true that it maps from -inf to +inf instead of [0,1] (because of the logistic curve transformation), but I've had linear outperform logistic for classification in the past. And I've confirmed this finding with several Professors of Statistics across the country (because I started asking!). "How is that possible?" you might ask.
One answer is that it depends on the metric you use for goodness of fit. Sometimes for classification we are not interested in classification error per se, or Type I vs. Type II errors (or Precision / Recall or Sensitivity / Specificity or any "Confusion Matrix" metric). These score the entire population. I presume that logistic will usually win there. Sometimes however I am looking for "lift in the top decile" or even "false alarm rate in the top 100 scores" (both of those are actual metrics I have used in models delivered to clients that were put into production). In this case, linear sometimes has won. One reason I think this occurs is that the sensitivity to the tails of distributions is more acute in linear that logistic (because logistic squashes influence of errors as the prob. approaches 0 or 1). But I've never proved it (maybe I should!). I just know that empirically, I keep linear regression in the mix for some problems (based on the error metric I am using).
Dear R Bohn,
We do not delete comments and we like to have contrasted opinions. My main argument is that logistic regression should not be taught first. If you are working with a team of statisticians, and the analysis performed correctly, and the data is sound, I have no doubt that you will understand each other. But when explaining your results to your CEO or engineers, you may face incomprehension. This is just similar to my arguments that you don't need statistical tests of hypotheses or p-values. Engineers (including myself, though technically being a statistician) have been using alternate models that produce the same results if not better (being data-driven rather than model-driven), easier to understand, and you don't even need to know what a random variable is to understand how it works.
Vincent
I am sorry to report that this article is nonsense. It's not the conclusion - use it or don't use it, there are now many alternatives to logistic regression. (Which in the machine learning world is a "linear classifier." )
The difficulty is that most of the discussion is Just Wrong. Analytically incorrect. No correspondence to the usual definitions, use, and interpretation of logistic regression.
There certainly are some mild criticisms of logistic regression, but in situations where a linear model is reasonably accurate, it is a good quick model to try. Of course, if the situation is highly nonlinear, a tree model is going to be better. Furthermore, the particular logistic equation generally used should not be considered sacred.
My interpretation is that this article is an attack on a straw man, an undefined and radically unconventional model that is here being called "logistic regression." It would be a shame if anyone took it seriously. We will see if the author/site manager leaves this comment up. If he does, I invite him to respond and explain the meaning of diagram.
By the way, I agree with much of the discussion on the medcalc website I'm quoting, but not all of it.
Vincent Granville Thanks for the answer; Do you recommend any book?
First things to learn should probably be:
Thanks for the article...but what should be the first thing to learn?
© 2020 Data Science Central ® Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central