Summary:  Exceptions sometimes make the best rules.  Here’s an example of well accepted variable reduction techniques resulting in an inferior model and a case for dramatically expanding the number of variables we start with.


One of the things that keeps us data scientists on our toes is that the well-established rules-of-thumb don’t always work.  Certainly one of the most well-worn of these rules is the parsimonious model; always seek to create the best model with the fewest variables.  And woe to you who violate this rule.  Your model will over fit, include false random correlations, or at very least will just be judged to be slow and clunky.

Certainly this is a rule I embrace when building models so I was surprised and then delighted to find a well conducted study by Lexis/Nexis that lays out a case where this clearly isn’t true.


A Little Background

In highly regulated industries like insurance and lending the variables that are allowed for use are highly regulated as are the modeling techniques.  Techniques are generally limited to those that are highly explainable, mostly GLM and simple decision trees.  Data can’t include anything that is overtly discriminatory under the law so, for example, race, sex, and age can’t be used, or at least not directly.  All of this works against model accuracy.

Traditionally what agencies could use to build risk models has been defined as ‘traditional data’, that which the consumer has submitted with their application and the data that can be added from the major credit rating agencies.  In this last case Experian and the others offer some 250 different variables and except for those that are specifically excluded by law, this seems like a pretty good sized inventory of predictive features.

But in the US and especially abroad the market contains many ‘thin-file’ or ‘no-file’ consumers who would like to borrow but for which traditional data sources simply don’t exist.  Millennials feature in this group because their cohort is young and doesn’t yet have much borrowing or credit history.  But also in this group are the folks judged to be marginal credit risks, some of whom could be good customers if only we knew how to judge the risk.


Enter the World of Alternative Data

‘Alternative data’ is considered to be any data not directly related to the consumer’s credit behavior, basically anything other than the application data and consumer credit bureau data.  A variety of agencies are prepared to provide it and it can include:

  1. Transaction data (e.g. checking account data)
  2. Telecom/utility/rent data
  3. Social profile data
  4. Social network data
  5. Clickstream data
  6. Audio and text data
  7. Survey data
  8. Mobile app data

As it turns out lenders have been embracing alternative data for the last several years and see real improvements in their credit models, particularly at the low end of the scores.  Even the CFPB has provisionally endorsed this to bring credit to the underserved.


From a Data Science Perspective

From a data science perspective, in this example we started out with on the order of 250 candidate features from ‘traditional data’, and now, using ‘alternative data’ we can add an additional 1,050 features.  What’s the first thing you do when you have 1,300 candidate variables?  You go through the steps necessary to identify only the most predictive variables and discard the rest.


Here’s Where It Gets Interesting

Lexis/Nexis, the provider of the alternative data, set out to demonstrate that a credit model built on all 1,300 features was superior to one built on only 250 traditional features.  The data was drawn from a full-file auto lending portfolio of just under 11 million instances.  You and I might have concluded that even 250 was too many but in order to keep the test rigorous they introduced these constraints. 

  1. The technique was limited to forward stepwise logistic regression. This provided clear univariate feedback on the importance of each variable.
  2. Only two models would be compared, one with the top 250 most predictive attributes and the other with all 1,300 attributes. This eliminated any bias from variable selection that might be introduced by the modeler.
  3. The variables for the 250 var model were selected by ranking the predictive power of each variables correlation to the dependent variable. As it happened all of the alternate variables fell outside the top 250 with the highest ranking 296th.
  4. The models were created with the same overall data prep procedures such as binning rules.


What Happened

As you might expect, the first and most important variable was the same for both models but began to diverge at the second variable.  The second variable in the 1,300 model was actually 296th based on the earlier predictive power analysis. 

When the model was completed the alternative data made up 25% of the model’s accuracy although none would have been included based on the top 250 predictive variables.

The KS (Kolmogorov-Smirnov) statistic was 4.3% better for the 1,300 model compared to the 250 model.


The Business Importance

The distribution of scores and charge offs for each models was very similar but in the bottom 5% of scores things changed.  There was a 6.4% increase in the number of predicted charge offs in this bottom group. 

Since the distributions are the essentially the same this can be seen as higher scores that might have been rated credit worthy migrating into the lowest categories of credit worthiness allowing better decisions about denial or pricing based on risk.  Conversely it appears that some lowest rated borrowers were given a boost with the additional data.

That also translates to a competitive advantage for those using the alternative data compared to those who don’t.  You can see the original study here.


There are Four Lesson for Data Scientists Here

  1. Think outside the box and consider the value of a large number of variables when first developing or refining your model. It wasn’t until just a few years ago that the insurance industry started looking at alternative data and on the margin it has increased accuracy in important ways.  FICO published this chart showing the relative value of each category of alternative data strongly supporting using more variables.


  1. Be careful about using ‘tried and true’ variable selection techniques. In the Lexis/Nexis case starting the modeling process with variable selection based on univariate correlation with the dependent variable was misleading.  There are a variety of other techniques they could have tried.
  2. Depending on the amount of prep, it still may not be worthwhile expanding your variables so dramatically. More data always means more prep means more time which in a commercial environment you may not have.  Still, be open to exploration.
  3. Adding ‘alternate source’ data to your decision making can be a two edged sword. In India, measures as obscure as how often a user charges his cell phone or its average charge level has proven to be predictive.  In that credit-starved environment these innovative measures are welcomed when they provide greater access to credit.

On the other hand just this week a major newspaper in England published as expose of comparative auto insurance rates where it discovered that individuals applying with a Hotmail account were paying as much as 7% more than those with Gmail accounts.  Apparently British insurers had found a legitimate correlation between risk and this alternative data.  It did not sit well with the public and the companies are now on the defensive.



About the author:  Bill Vorhies is Editorial Director for Data Science Central and has practiced as a data scientist since 2001.  He can be reached at:

[email protected]

Views: 5750

Tags: FICO, Lexis/Nexis, alternative data, feature reduction, parsimonious, parsimony, variable reduction


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Gerard Crispie on February 2, 2018 at 2:13am

Good article that makes the point that it is well worth considering 'new' and alternative data sources.  However it does not (as Syed has also mentioned) say much about the effort involved in testing so many variables and also (and importantly) the effort involved in building and incorporating these source feeds. Thanks, Gerard

Comment by charles d huett on February 1, 2018 at 12:19pm

counterintuitive. nonetheless, the results demand a rethink about dimension reduction.

Comment by William Vorhies on February 1, 2018 at 11:28am

Ashley:  I don't know enough about genetic research to comment and I expect there are some unique requirements.  There are some interesting things going on with deep learning which make CNNs very sensitive to anomalies which I suspect is similar to your situation.  You might also look into genetic programs (not related to your field at all) but able to handle a very large number of variables and identify those most important without prior steps to reduce variables.

Syed:  The caution I offer is mostly economic.  The time cost of examining 1300 variables is much greater than 250 variables.  You'll need to take manpower cost of model development into consideration when considering, in this case, that they got about a 5% improvement at the bottom end of their model.

Comment by Ashley Silver on February 1, 2018 at 6:16am

Would this be a similar problem addressed by genetic researchers wanting to determine impact of certain genes/alleles and their impact on our health/characteristics?  If so, the problem might be overwhelming to determine a complete set of variables that impact a given outcome?  In that case, where there are potentially millions of variables, how would you approach variable selection?  Would it make sense to prime the search?

Comment by SYED ADEEL HUSSAIN on February 1, 2018 at 3:39am

the article completely contradicts itself at the end. First, the author wants you to add as many explanatory random variables as possible to the regression model (or any other model) and later wants the reader to exercise caution! 

Data Mining Methods can significantly increase specification and estimation risks because using improper variables having spurious correlations can distort the reality. Random Correlations alone tell you nothing about statistical significance. At times model suffer from Endogeneity and other problems which have not been discussed.

Comment by Ian Lo on January 31, 2018 at 3:27pm

Thanks for sharing the article! I am new to data science and would like to kindly seek clarification on the following:

What are the AIC scores for the model? More often than not, more variables would tend to result in a model that overfits - so why in this case is the situation different? I read somewhere that for the K-S test, 

  1. It only applies to continuous distributions.
  2. It tends to be more sensitive near the center of the distribution than at the tails.
  3. Perhaps the most serious limitation is that the distribution must be fully specified. That is, if location, scale, and shape parameters are estimated from the data, the critical region of the K-S test is no longer valid. It typically must be determined by simulation.


Would evaluating the model using another goodness-of-fit measure (AIC / 2-LL) or even the typical precision / recall scoring based on hold-out data be more relevant to the business scenario? Because this is applied to credit scoring shouldn't the goodness of a model be based on cost / benefit analysis?

Finally, if they are using logistic regression, I am assuming that they are classifying the applicants into the individual break levels - so a multi-class AUC score would also be interesting to see if overfitting occurs.

Apologies for the very basic / noob questions - I would like to learn as much as possible from this.



Comment by Patrick Cardiff on January 31, 2018 at 10:24am

William Vorhies, Thanks!

Credit is one of those black box issues I never really understood. Your piece made the problem accessible, and I'm going to try to meld it into my economics of inequality. You got me going on how to approach the "embarrassment of information" when it comes to model refinement, as well as the overall context of usability. And there's not even problems with strengthening the coefficients in hedonic modeling. So I'm saying I appreciate you!

Pat Cardiff

Comment by Vincent Granville on January 30, 2018 at 12:08pm

Very interesting!

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service