As I mentioned in our post on Data Preparation Mistakes, we've built many predictive models in the Rapid Insight office. During the predictive modeling process, there are many places where it's easy to make mistakes. Luckily, we've compiled a few here so you can learn from our mistakes and avoid them in your own analyses:
Failing to consider enough variables
When deciding which variables to audition for a model, you want to include anything you have on-hand that you think could possibly be predictive. Weeding out the extra variables is something that your modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.
Not hand-crafting some additional variables
Any guide-list of variables should be used as just that – a guide – enriched by other variables that may be unique to your institution. If there are few unique variables to be had, consider creating some
to augment your dataset. Try adding new fields like “distance from institution
” or creating riffs and derivations of variables you already have.
Selecting the wrong Y-variable
When building your dataset for a logistic regression model, you’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, you’ll actually want to model attrition, identifying those students who are likely to leave (hopefully the smaller group!) rather than those who are likely to stay.
Not enough Y-variable responses
Along with making sure that your model population is large enough (1,000 records minimum) and spans enough time (3 years is good), you’ll want to make sure that there are enough Y-variable responses to model. Generally, you’ll want to shoot for at least 100 instances of the response you’d like to model.
Building a model on the wrong population
To borrow an example from the world of fundraising, a model built to predict future giving will look a lot different for someone with a giving history than someone who has never given before. Consider which population you’d eventually like to use the model to score and build the model tailored to that population, or consider building two models, one for each sub-group.
Judging the quality of a model using one measure
It’s difficult to capture the quality of a model in a single number, which is why modeling outputs provide so many model fit measures. Beyond the numbers, graphic outputs like decile analysis
and lift analysis can provide visual insight into how well the model is fitting your data and what the gains from using a model are likely to be.
If you’re not sure which model measures to focus on, ask around. If you know someone building models similar to yours, see which ones they rely on and what ranges they shoot for. The take-home point is that with all of the information available on a model output, you’ll want to consider multiple gauges before deciding whether your model is worth moving forward with.