How do you know what independent variables to include in a regression model?

After a few biostatistics classes, I began fitting my first logistic regression model using my physician friend’s data on tumors excised from skin cancer patients. I realized that although we were very clear about the dependent variable we were trying to predict – a certain feature of the tumor – I really did not know how to pick the independent variables that belonged in the model. We only had a few to choose from in our dataset, and I put them all into the model, but I wasn’t really sure what to do next. Remove the ones that had a slope with a p < 0.05? Add interactions?

I asked one of my professors what to do, and in her own idiosyncratic way, she seemed to describe what I will call “stepwise selection”. Hosmer is among the co-authors of this landmark article which compares different methods of deciding which independent variables to put in a regression model, and the authors use the term stepwise selection to mean what my professor described. I will then use that term in this post, but I have observed that there is a lot of confusion about what these different terms mean. Consequently, I will be very careful to define exactly what I mean as I continue.

The landmark article covers a few different approaches, three of which I will summarize in this table:

My opinions of these approaches are as follows:

  • Backward elimination: This is the most popular answer I get when I ask people their modeling approach, and often, they say they do it because they were educated to use this approach. But when I ask them how they overcome the problem of the potential overload of variables causing small cells and collinearity in the initial model, I often am met with a sheepish grin. Analysts have various “secret” ways of handling this (e.g., looking at the variables causing trouble and removing them from the candidate pool). I do not like this approach because I’d prefer to have a modeling strategy that is pre-specified, replicable, and that I can report transparently.
  • Forward selection: I have seen this taught as a simple way to make models, but I am against this approach. The authors of the landmark article have the same opinion, and demonstrate using data how this strategy produces models that are far inferior than ones produced through either backward elimination or stepwise selection.
  • Stepwise selection: As the landmark article demonstrates, through this process, a model that fits as well as a model developed through backward elimination can be created. However, I strongly prefer stepwise selection to backward elimination because of certain characteristics of stepwise selection.

Stepwise selection has these characteristics:

  • A list of candidate independent variables is created before modeling commences. In other words, we don’t want to “fish” and just try every variable – just the ones we think should qualify to be in the model.
  • Also prior to modeling, a set of criteria is decided upon for choosing which independent variables to retain in the model after each iteration. In epidemiology, I tend to say that I will retain variables after each iteration that either have a slope with a p < 0.05, or that I feel need to be in there for some empirical reason (e.g., I am controlling for certain components of the underlying study design).
  • Model iterations are done manually in two rounds. The first round determines the working model. In the first round, each candidate variable is introduced into the model in each subsequent iteration. If it qualifies to be retained it, it is, and if not, it is removed. Actually, all of the variables are evaluated in each iteration, and are removed if they do not meet the criteria, even if they met it in a previous iteration. In other words, this process can remove variables that were retained in earlier iterations. At the end of this first round, the analyst has what I call a “working model”.
  • The second round’s goal is to try to “break” the working model. This is done by re-introducing variables that were not retained in the working model into the working model one at a time. After each re-introduction, the retention criteria are applied to all variables in the model, and those that no longer meet the criteria are removed. Often, you will find some tussling between variable that are collinear with the dependent variable in this step. This is where the dicey modeling decisions are made. The result of this round is the final model.

At the end of the second round, you have your final model, all the while avoiding the inflation of Type I error, or criticism for a lack of a replicable and transparent modeling strategy. You also come away with an understanding and interpretation of what you found. What’s great about this process is that you are really getting your hands dirty with your data, and you are understanding how the variables are behaving when they are put in the model together. I often think of it as putting young children in a sandbox together, and seeing which ones get along and which ones fight. The information you glean from the process of stepwise selection can really help you later, when you are writing an interpretation of your final model.

Views: 1341

Tags: biostatistics, data, dsc_analytics, dsc_biotech, dsc_tagged, modeling, science, selection, statistical, statistics, More…variable


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Vincent Granville on November 15, 2020 at 8:27am

If you use a robust regression models (like the one I designed, see here) then you can use all variables, even duplicate copies of the same variable (they won't have any impact on final results, and the multi-collinearity won't be an issue).

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service