After a few biostatistics classes, I began fitting my first logistic regression model using my physician friend’s data on tumors excised from skin cancer patients. I realized that although we were very clear about the dependent variable we were trying to predict – a certain feature of the tumor – I really did not know how to pick the independent variables that belonged in the model. We only had a few to choose from in our dataset, and I put them all into the model, but I wasn’t really sure what to do next. Remove the ones that had a slope with a p < 0.05? Add interactions?
I asked one of my professors what to do, and in her own idiosyncratic way, she seemed to describe what I will call “stepwise selection”. Hosmer is among the co-authors of this landmark article which compares different methods of deciding which independent variables to put in a regression model, and the authors use the term stepwise selection to mean what my professor described. I will then use that term in this post, but I have observed that there is a lot of confusion about what these different terms mean. Consequently, I will be very careful to define exactly what I mean as I continue.
The landmark article covers a few different approaches, three of which I will summarize in this table:
My opinions of these approaches are as follows:
Stepwise selection has these characteristics:
At the end of the second round, you have your final model, all the while avoiding the inflation of Type I error, or criticism for a lack of a replicable and transparent modeling strategy. You also come away with an understanding and interpretation of what you found. What’s great about this process is that you are really getting your hands dirty with your data, and you are understanding how the variables are behaving when they are put in the model together. I often think of it as putting young children in a sandbox together, and seeing which ones get along and which ones fight. The information you glean from the process of stepwise selection can really help you later, when you are writing an interpretation of your final model.
Views: 582
Tags: biostatistics, data, modeling, science, selection, statistical, statistics, variable
Comment
If you use a robust regression models (like the one I designed, see here) then you can use all variables, even duplicate copies of the same variable (they won't have any impact on final results, and the multi-collinearity won't be an issue).
© 2020 TechTarget, Inc. Powered by
Badges | Report an Issue | Privacy Policy | Terms of Service
Most Popular Content on DSC
To not miss this type of content in the future, subscribe to our newsletter.
Other popular resources
Archives: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More
Most popular articles
You need to be a member of Data Science Central to add comments!
Join Data Science Central