Subscribe to DSC Newsletter

As I mentioned in our post on Data Preparation Mistakes, we've built many predictive models in the Rapid Insight office. During the predictive modeling process, there are many places where it's easy to make mistakes. Luckily, we've compiled a few here so you can learn from our mistakes and avoid them in your own analyses:

Failing to consider enough variables
When deciding which variables to audition for a model, you want to include anything you have on-hand that you think could possibly be predictive. Weeding out the extra variables is something that your modeling program will do, so don’t be afraid to throw the kitchen sink at it for your first pass.
Not hand-crafting some additional variables
Any guide-list of variables should be used as just that – a guide – enriched by other variables that may be unique to your institution.  If there are few unique variables to be had, consider creating some to augment your dataset. Try adding new fields like “distance from institution” or creating riffs and derivations of variables you already have.
Selecting the wrong Y-variable
When building your dataset for a logistic regression model, you’ll want to select the response with the smaller number of data points as your y-variable. A great example of this from the higher ed world would come from building a retention model. In most cases, you’ll actually want to model attrition, identifying those students who are likely to leave (hopefully the smaller group!) rather than those who are likely to stay.
Not enough Y-variable responses
Along with making sure that your model population is large enough (1,000 records minimum) and spans enough time (3 years is good), you’ll want to make sure that there are enough Y-variable responses to model. Generally, you’ll want to shoot for at least 100 instances of the response you’d like to model.
Building a model on the wrong population
To borrow an example from the world of fundraising, a model built to predict future giving will look a lot different for someone with a giving history than someone who has never given before. Consider which population you’d eventually like to use the model to score and build the model tailored to that population, or consider building two models, one for each sub-group.
Judging the quality of a model using one measure
It’s difficult to capture the quality of a model in a single number, which is why modeling outputs provide so many model fit measures. Beyond the numbers, graphic outputs like decile analysis and lift analysis can provide visual insight into how well the model is fitting your data and what the gains from using a model are likely to be.
If you’re not sure which model measures to focus on, ask around. If you know someone building models similar to yours, see which ones they rely on and what ranges they shoot for. The take-home point is that with all of the information available on a model output, you’ll want to consider multiple gauges before deciding whether your model is worth moving forward with.  

Views: 13910


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Caitlin Garrett on July 3, 2013 at 4:16am


I appreciate you taking the time to draw out some of these ideas. I agree that this idea could certainly be expanded into a series of mistakes to avoid. Thanks for a great list of additions!


Comment by Phillip Middleton on July 2, 2013 at 11:06am

I like this discussion that Caitlin has brought out. The advice is salient. But aside from general methodical rigor she exposes, the ensuing discussion has  also brought out some of the questions surrounding the nature of what the world of 'Data Science' is, how it is perceived, how many people agree on what it is (or don't), and how it differs, say, from the realm of Computational and Graphical Statistics or other 'computational' sciences. And it appears that haggling over 'who does what and who has what title' still seems to exist between those who come from the comp sci background vs math/stats vs multidisciplinarians (I like to use the term 'true synthesizers' ) with high proficiency in each area that makes up their skillset. Just for sake of transparency I look at DS not as a specific position  but an evolution of the conversion of multiple disciplines (math/stats/comp sci/phys/econ/etc to provide evidence-based 'things' (and I'm not quite sure the term requires a BLS review as previously mentioned - but that's outside of the scope of this discussion).

At the heart of this discussion are the scientific requirements, constraints, assumptions, hypotheses, and uncertainty that work into any endeavor. When I look at the above I see, in slightly different terms:

  • Review sample size requirements (along with which I would make sure I understood something about the power of a particular test one is applying to a problem). Of course, this can be a mixed bag as well, with many trade-offs between using bayesian methods and frequentist methods to estimate adequate sample sizes from which you can make a defensible conclusion. And as one poster mentioned, yep - can't assume automatically that random assignment to training and test samples will optimize 'similarity' between groups you are testing a model on. This is where distributional analysis and clustering can help check those pitfalls, and IMO, are a must. The higher the variance in the data given by smaller population sizes, the less likely samples will be similar when considering all variables you are testing in the model you trained. 

  • Make sure you are not committing a type III error. This applies to both a response (or multiple response) variables as well as the choice of the population of interest. Very common thing to do is give an answer, right or wrong, to the wrong research question in the first place. 

  • Start with a broad base of variables and work toward parsimony. Well, this depends on what your goal is.   Keep in mind that, depending on the problem type and the application, variables may naturally 'self-attrite' (i.e. think dim reduction methods here) as they contain redundant or overlapping information with other variables and are very unlikely to add anything more meaningful to a model. 
    • The moral here is this: are you treating a model as comprising variables that are i.i.d. (this is one way to maintain scientific soundness, right? - ), or are you allowing for variables that may demo some degree of multicollinearity or heterogeneity - further, do you have a good reason for doing so (in general, this kind of practice can make explaining effects quite the headache, but they can boost the accuracy/precision of certain 'black box' methods)?
  • Be prepared to transform variables: various degrees of nonlinear relationships between variables may be more reasonable when fitting a variable to a developing model. Often this occurs as a surprise in the data exploration phase. This is particularly useful when discovering and accounting for interactive effects which should be incorporated in the model (if one is modeling risk for example, this is a compulsory exercise). 

  • Model quality: As stated, comparison of multiple models can be important here. In general, we can all think of things that typify "quality", like goodness-of-fit (where appropriate - beware that some testing methods don't have closed form solutions to give you a good answer), accuracy/precision (depends on the type of model - and the problem of 'good enough', 'as good as it gets', and just 'good'),  sensitivity/specificity, deviance (as in the case of residuals in various generalized models), intuitiveness (if not a 'black box' application), and some understanding of the information retained or lost during the modeling process (think AIC, BIC, and so on). In Bayesian outputs, reasonability of probability matrices given priors - lots of ways to skin that cat. 
    • However, the most important rule of model dev IMO is 'soundness' . Regardless of the output, was the journey in getting there defensible, and something which could be repeatable (training/testing/validation doesn't measure this, otherwise - NO model would fail)? No one can guarantee predictions, only the soundness of the road trip getting there (was the car prepared correctly for the trip, could it make another one of similar magnitude, is it still intact, did it crash along the way, etc.).

Frankly one of the largest problems I have seen in this arena is watching the deluge of presentations farmed out to executives with all of the lingo, code, and graphical 'bling' one could possibly drool over. Except, the presenters are often not prepared to defend the soundness of a work. It's amazing how one can simply say 'It just works', which is sometimes interestingly enough for stakeholders to squelch any skeptical inquiry to the presenter about what is happening 'under the hood' to drive a recommendation or conclusion.

If anything, Caitlin's discussion does indeed deserve a series as there are plenty more landmines than what could be placed in this blog alone.  Well done.  

Comment by Matt Thompson on June 20, 2013 at 6:12am

Caitlin: thank you very much for your post!


By stating in your introduction, "...we've compiled a few," you informed readers this isn't the Superset. In fact, this encourages us all to discover our way through learning more of the "art" of this science. Point 6 inspired me to code up all my pet model performance measures (and now, more) and print them out together - both efficient and objective. Another key point is how it highlights that folks building predictive models come from diverse backgrounds. We frequently call the same thing by a different name or use various go-to examples to frame things up, but at end of the day we can all learn something from one another.

Bottom line: the point of this post was not about the 6 as much as it highlighted inquiry, which is the nature of why we're here and why others listen to us.

Thanks again!

Comment by Cristian Vava on June 11, 2013 at 8:24am

@Ralph, @Caitlin, interesting exchange of ideas, a clear proof of the diversity of applications and perspectives data science is encompassing these days. I’m tempted to say that each of you sounds as a clear representative of a distinct scientific discovery method.


Using Peter Medawar’s taxonomy we can see that:

Caitlin embraces the Baconian method, concerned with discovering patterns and building structure starting from near zero prior knowledge. She likes to try things out and she's not afraid of mistakes.

Ralph represents the Aristotelian method, some solid domain knowledge is available and data scientist’ task is to solidify the intuition and formalize the knowledge. He needs precision and efficiency.


You are both correct and readers of this post could learn a lot from your interventions since you have covered two essential stages of knowledge discovery. I hope we’ll also see interventions from representatives of the Kantian and Galilean methods.


Comment by Lynne Mysliwiec on June 8, 2013 at 9:15am

@Ralph. But you didn't add, you subtracted.  If you wished to add a #7 of your own - "Beware overfitting" with some guidelines as to how to diagnose the pitfall and avoid it, you'd be adding to the discussion & not trolling.  By treating Caitlin as if she made a mistake & that everyone else is wrong and therefore are "modelers" and not "data scientists" as defined by Ralph, this is where you err & why I am bothering to call you out.

Comment by Ralph Winters on June 8, 2013 at 9:01am


I think I did add to the discussion via my initial comment.  If I recall, it was the only comment at that time. My aim in all of this is to provide open dialogue regarding any topic in which I can share my knowledge and expertise. If a post titled "6 predictive modeling mistakes" contains what I feel is a major mistake, I will point that out. EVERYONE benefits from open discussion.  We do not want these blog posts to be one sided, where the blogger does all the talking and we just listen, do we?

Check out the name of this forum.  It is called "Data Science Central".  If believe that includes a lot more than JUST the predictive modeling part, and THAT is what I am trying to encourage.

Not questioning Caitlin's competency, and I apologize if I hurt her feelings in any way.


Comment by Lynne Mysliwiec on June 8, 2013 at 9:01am

@Caitlin - I think your list of six common pitfalls is a great one - I apologize if, by adding 2 of my favorite pitfalls I made you feel as if you didn't go far enough, because I think you were right on the money -- I look forward to seeing more of you out here in the discussion group.

Comment by Lynne Mysliwiec on June 8, 2013 at 8:25am

@Ralph... whatever, dude...the article is about common pitfalls in predictive modeling...can you add to the discussion or not? Caitlin doesn't deserve this and none of us need the lesson in data science.  My assumption is that Caitlin is a competent professional and practitioner in data science whose only desire is to provide useful information. My assumption is that you are a competent practitioner as well. You fired the opening shot with your "only a modeler" comments, but thanks for verifying that you're trolling.

Comment by Ralph Winters on June 8, 2013 at 7:48am

Data scientists rely heavily on domain knowledge, so by the time the predictive model is to be built, no kitchen sink approach is necessary. Perhaps the data science team has been performing a set of multidimensional visualizations to examine and visualize some of the intricate relationships among the variables. Think multidisciplinary, multi tool set. This is the point that the initial variables of interest could be identified, and placed in a predictive model, rather that starting with all the variable you could think of and then essential dropping variables out of the model.  How would you do this with the business user?  Would you rely on the just pure stat? This approach is fine if you are purely in the predictive modeling domain, but I think that the process of data science can add a lot more value.

Comment by Lynne Mysliwiec on June 8, 2013 at 6:48am

@Ralph - Really?  Not sure what you're getting at with your reference to data science. Either you build models properly or you don't -- what you call yourself is immaterial. If you're saying that the business and math training of your "data scientists" is sub par, then maybe they should call themselves database administrators instead of data scientists.

Follow Us


  • Add Videos
  • View All


© 2018   Data Science Central™   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service