For statistical models, selecting those predictors is what tests the steel of data scientists. It is really challenging to lay out the steps, as for every step, they should evaluate the situation and make decisions for the next or upcoming steps. It is a completely different story when running predictive models, and if relationship among the variables is not the main focus, situations get easier. Data analysts can go ahead to run step-wise regression models, empowering the data to give best predictions. However; if the main focus is on answering research questions that describe relationships, it can give analysts a really tough time.

If they willingly ignore practical issues, it’s easy for them to use theory or test research questions. To exemplify, say there are 10 different variables, and all of them measure the same theoretical construct, without the clarity as to which one to use. Or that decision analysts could, theoretically, make the case for all 40 demographic control variables. But when you put them all in together, all of their coefficients become non-significant.

This is one way of looking at things. Cook and Weisberg in Residuals and Influence, Draper and Smith in Applied Regression Analysis, 3rd Edition; to name a few, are of the opinion that step-wise process usually is not capable of giving out the best predictions. They say that users seeking the best of prediction models should go ahead and use branch-and-bound algorithm, though it’s gone the way of the Dodo. Outside of that, the step-wise algorithms are only of historical interest. Branch-and-bound methods are an imperfect solution, though they are much faster than all possible regressions; they remain exponential time algorithms.

**So the question is how to do it? There are some guidelines, which if considered religiously; can assure of if not the “best”, “near to best predictions”.**

1. Regression coefficients are marginal results

It means that coefficient for each predictor has its unique effect on the response variable, though these effects are not at their fullest, unless all of them are independent. It’s the effect that emerges after controlling other variables in the model. What else is there in the model?

Coefficients can change quite a bit, depending on what else is in the model. It’s really interesting that if two or more predictors overlap in the way they explain the outcomes, the overlap will not get highlighted in wither of the regression coefficient. It’s in the overall model F statistic and the R-squared, but not the coefficients.

2. Instead of bell curves, look out for interesting breaks in the middle of the distribution

It will not be an exaggeration to say” It is mandatory to always start with descriptive statistics”. It certainly will help in finding errors, which one might have missed during data cleansing. But more importantly, it will highlight what you are working with. Univariate descriptive graphs to start with. Remember, the motive here is to not look out for bell curves, but for those interesting breaks in the middle of the distribution.

But more importantly, you have to know what you’re working with; as they are values with huge number of points. They are the values with more than usual or less variation than expected. Upon plugging these variables in the model, they might act funny. So you should be aware of how they would look, to have a better understanding why.

3. Running bivariate descriptive including graphs, should be the next step

Understanding of how each potential predictor relates to the outcome and to other predictors is of great help. As mentioned earlier, regression coefficients are marginal results, and knowing the bivariate relationships amongst these variables helps analysts attain insights into why certain variable lose significance in bigger models.

In addition to correlations or crosstabs, scatterplots of the relationship are extremely informative. The reason to feel so is that, this is where you can see if linear relationships are plausible, or in times when you are required to deal with nonlinearity in some or other way.

4. Think of predictors as datasets

Predictors in theoretically distinct sets, empower you to see how related variables can work together and then what will happen once they all are put together. If we take the case of binge drinking in youngsters; potential sets of variables may include:

- Demographics (age, year in school, socio-economic status)
- History of Mental Health (diagnoses of mental illness, family history of alcoholism)
- Current psychological health (stress, depression)
- Social issues (feelings of isolation, connection to family, number of friends)

Usually, variables within a set are correlated, but not across sets. So by putting everything in at once, it becomes challenging to assess or find relationships; & it will turn out to be a huge mess. Hence, building each set separately, building a theoretically meaningful model with solid understanding of how the pieces fit together – makes much more sense.

5. Build model while you interpret and vice versa..!!

Data scientists should stop and listen to the story the model, they run has to tell.

Did the R-squared change? Take time out to look at the coefficients. How much have the coefficients changed from the model with control variables to one without?

Pause to do all this, as it helps in making better decisions on the model to run next.

6. Include every single variable used in an interaction, in the model

Getting rid of unwanted and things that are not significant becomes easier, upon deciding what to leave in and what to boot from the model.

Eliminating non-significant interactions first is always a good idea. The only exception to this is when the interaction is central to the research question, and is important to portray that it was not significant.

In case the interaction is significant, you can go ahead and take out the terms from the component variables. These should be the ones which make up interactions. The interpretation of the interaction is only possible if the component term is in the model.

7. Stay focused on the research question

Increased data influx assures you of large data sets to work on. And while doing so, it is very easy to step off the brick road – straight into the poppies. You may discover so many interesting relationships, which of course will be shiny. Ultimately a few months down the line, you would be testing every possible predictor, categorized in every possible manner; but all this without any considerable progress.

Stay focused on your destination – the research question. Writing it and taping it on to the wall should certainly help.

**Appropriate interpretations and not “correctness”, should be the focus**

Aforementioned steps are for any type of model-linear regression, ANOVA, logistic regression, mixed models. Considering them or at least giving them a first preference while doing statistical analysis – is what is suggested.

The point here is that they are more useful for prediction models as compared to theoretical ones. Branch and bound methods though seem easily to adapt, may not prove that helpful in such situations. Also allowing interactions without main effects is not always an error. However, would stand by the fact that meaning to several model parameters change as and when these parameters are removed or added, for that matter. Researchers or data scientists, trying to assume a certain meaning of their interaction coefficient, should realize that it likely would have changed. The point of importance here is appropriate interpretations than “correctness”.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

**Technical**

- Free Books and Resources for DSC Members
- Learn Machine Learning Coding Basics in a weekend
- New Machine Learning Cheat Sheet | Old one
- Advanced Machine Learning with Basic Excel
- 12 Algorithms Every Data Scientist Should Know
- Hitchhiker's Guide to Data Science, Machine Learning, R, Python
- Visualizations: Comparing Tableau, SPSS, R, Excel, Matlab, JS, Pyth...
- How to Automatically Determine the Number of Clusters in your Data
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- Fast Combinatorial Feature Selection with New Definition of Predict...
- 10 types of regressions. Which one to use?
- 40 Techniques Used by Data Scientists
- 15 Deep Learning Tutorials
- R: a survival guide to data science with R

**Non Technical**

- Advanced Analytic Platforms - Incumbents Fall - Challengers Rise
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- How to Become a Data Scientist - On your own
- 16 analytic disciplines compared to data science
- Six categories of Data Scientists
- 21 data science systems used by Amazon to operate its business
- 24 Uses of Statistical Modeling
- 33 unusual problems that can be solved with data science
- 22 Differences Between Junior and Senior Data Scientists
- Why You Should be a Data Science Generalist - and How to Become One
- Becoming a Billionaire Data Scientist vs Struggling to Get a $100k Job
- Why do people with no experience want to become data scientists?

**Articles from top bloggers**

- Kirk Borne | Stephanie Glen | Vincent Granville
- Ajit Jaokar | Ronald van Loon | Bernard Marr
- Steve Miller | Bill Schmarzo | Bill Vorhies

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives**: 2008-2014 | 2015-2016 | 2017-2019 | Book 1 | Book 2 | More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of Data Science Central to add comments!

Join Data Science Central