Six Predictive Modeling Mistakes

As I mentioned in our post on Data Preparation Mistakes, we've built many predictive models in the Rapid Insight office. During the predictive modeling process, there are many places where it's easy to make mistakes. Luckily, we've compiled a few here so you can learn from our mistakes and avoid them in your own analyses:

Failing to consider enough variables
When deciding which variables to audition for a model, you want to include anything you have on-hand that you think could possibly be predictive. Weeding out the extra variables is something that your modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.
Not hand-crafting some additional variables
Any guide-list of variables should be used as just that – a guide – enriched by other variables that may be unique to your institution.  If there are few unique variables to be had, consider creating some to augment your dataset. Try adding new fields like “distance from institution” or creating riffs and derivations of variables you already have.
Selecting the wrong Y-variable
When building your dataset for a logistic regression model, you’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, you’ll actually want to model attrition, identifying those students who are likely to leave (hopefully the smaller group!) rather than those who are likely to stay.
Not enough Y-variable responses
Along with making sure that your model population is large enough (1,000 records minimum) and spans enough time (3 years is good), you’ll want to make sure that there are enough Y-variable responses to model. Generally, you’ll want to shoot for at least 100 instances of the response you’d like to model.
Building a model on the wrong population
To borrow an example from the world of fundraising, a model built to predict future giving will look a lot different for someone with a giving history than someone who has never given before. Consider which population you’d eventually like to use the model to score and build the model tailored to that population, or consider building two models, one for each sub-group.
Judging the quality of a model using one measure
It’s difficult to capture the quality of a model in a single number, which is why modeling outputs provide so many model fit measures. Beyond the numbers, graphic outputs like decile analysis and lift analysis can provide visual insight into how well the model is fitting your data and what the gains from using a model are likely to be.
If you’re not sure which model measures to focus on, ask around. If you know someone building models similar to yours, see which ones they rely on and what ranges they shoot for. The take-home point is that with all of the information available on a model output, you’ll want to consider multiple gauges before deciding whether your model is worth moving forward with.  
Load Previous Comments
  • Matt Thompson

    Caitlin: thank you very much for your post!


    By stating in your introduction, "...we've compiled a few," you informed readers this isn't the Superset. In fact, this encourages us all to discover our way through learning more of the "art" of this science. Point 6 inspired me to code up all my pet model performance measures (and now, more) and print them out together - both efficient and objective. Another key point is how it highlights that folks building predictive models come from diverse backgrounds. We frequently call the same thing by a different name or use various go-to examples to frame things up, but at end of the day we can all learn something from one another.

    Bottom line: the point of this post was not about the 6 as much as it highlighted inquiry, which is the nature of why we're here and why others listen to us.

    Thanks again!

  • Phillip Middleton

    I like this discussion that Caitlin has brought out. The advice is salient. But aside from general methodical rigor she exposes, the ensuing discussion has  also brought out some of the questions surrounding the nature of what the world of 'Data Science' is, how it is perceived, how many people agree on what it is (or don't), and how it differs, say, from the realm of Computational and Graphical Statistics or other 'computational' sciences. And it appears that haggling over 'who does what and who has what title' still seems to exist between those who come from the comp sci background vs math/stats vs multidisciplinarians (I like to use the term 'true synthesizers' ) with high proficiency in each area that makes up their skillset. Just for sake of transparency I look at DS not as a specific position  but an evolution of the conversion of multiple disciplines (math/stats/comp sci/phys/econ/etc to provide evidence-based 'things' (and I'm not quite sure the term requires a BLS review as previously mentioned - but that's outside of the scope of this discussion).

    At the heart of this discussion are the scientific requirements, constraints, assumptions, hypotheses, and uncertainty that work into any endeavor. When I look at the above I see, in slightly different terms:

    • Review sample size requirements (along with which I would make sure I understood something about the power of a particular test one is applying to a problem). Of course, this can be a mixed bag as well, with many trade-offs between using bayesian methods and frequentist methods to estimate adequate sample sizes from which you can make a defensible conclusion. And as one poster mentioned, yep - can't assume automatically that random assignment to training and test samples will optimize 'similarity' between groups you are testing a model on. This is where distributional analysis and clustering can help check those pitfalls, and IMO, are a must. The higher the variance in the data given by smaller population sizes, the less likely samples will be similar when considering all variables you are testing in the model you trained. 

    • Make sure you are not committing a type III error. This applies to both a response (or multiple response) variables as well as the choice of the population of interest. Very common thing to do is give an answer, right or wrong, to the wrong research question in the first place. 

    • Start with a broad base of variables and work toward parsimony. Well, this depends on what your goal is.   Keep in mind that, depending on the problem type and the application, variables may naturally 'self-attrite' (i.e. think dim reduction methods here) as they contain redundant or overlapping information with other variables and are very unlikely to add anything more meaningful to a model. 
      • The moral here is this: are you treating a model as comprising variables that are i.i.d. (this is one way to maintain scientific soundness, right? - ), or are you allowing for variables that may demo some degree of multicollinearity or heterogeneity - further, do you have a good reason for doing so (in general, this kind of practice can make explaining effects quite the headache, but they can boost the accuracy/precision of certain 'black box' methods)?
    • Be prepared to transform variables: various degrees of nonlinear relationships between variables may be more reasonable when fitting a variable to a developing model. Often this occurs as a surprise in the data exploration phase. This is particularly useful when discovering and accounting for interactive effects which should be incorporated in the model (if one is modeling risk for example, this is a compulsory exercise). 

    • Model quality: As stated, comparison of multiple models can be important here. In general, we can all think of things that typify "quality", like goodness-of-fit (where appropriate - beware that some testing methods don't have closed form solutions to give you a good answer), accuracy/precision (depends on the type of model - and the problem of 'good enough', 'as good as it gets', and just 'good'),  sensitivity/specificity, deviance (as in the case of residuals in various generalized models), intuitiveness (if not a 'black box' application), and some understanding of the information retained or lost during the modeling process (think AIC, BIC, and so on). In Bayesian outputs, reasonability of probability matrices given priors - lots of ways to skin that cat. 
      • However, the most important rule of model dev IMO is 'soundness' . Regardless of the output, was the journey in getting there defensible, and something which could be repeatable (training/testing/validation doesn't measure this, otherwise - NO model would fail)? No one can guarantee predictions, only the soundness of the road trip getting there (was the car prepared correctly for the trip, could it make another one of similar magnitude, is it still intact, did it crash along the way, etc.).

    Frankly one of the largest problems I have seen in this arena is watching the deluge of presentations farmed out to executives with all of the lingo, code, and graphical 'bling' one could possibly drool over. Except, the presenters are often not prepared to defend the soundness of a work. It's amazing how one can simply say 'It just works', which is sometimes interestingly enough for stakeholders to squelch any skeptical inquiry to the presenter about what is happening 'under the hood' to drive a recommendation or conclusion.

    If anything, Caitlin's discussion does indeed deserve a series as there are plenty more landmines than what could be placed in this blog alone.  Well done.  

  • Caitlin Garrett


    I appreciate you taking the time to draw out some of these ideas. I agree that this idea could certainly be expanded into a series of mistakes to avoid. Thanks for a great list of additions!