Multilevel Modelling of U.S. Home Loan Data

The housing market has undergone quite a change in the past decade, with more stringent lending criteria for housing having been enforced.

A key objective of financial institutions is to minimise the risk of mortgage lending by ensuring that the debtor is ultimately able to repay the loan.

In this example, multilevel modelling techniques are used to analyse data from the Federal Home Loan Bank System to determine the main influencing factors on loan-to-value ratios (LTVs) across the United States.

As a disclaimer – the below is not intended as any form of financial advice and is merely used as an example to illustrate the workings of the relevant models.

Feature Selection – ExtraTreesClassifier

The dataset in question is 55990 rows × 82 columns, and the first task is to conduct feature selection to ensure that the most relevant explanatory variables are selected for further analysis.

To this end, the ExtraTreesClassifier is run in Python – the script is available in the GitHub repository.

The ExtraTrees Classifier identified Income (x9), UPB or Unpaid Principal Balance (x12), Loan Amount (x50), Back-End Ratio (x52), and PMI or percent of original loan balance covered by mortgage insurance (x55) as the variables with the highest score as ranked by the classifier:

Linear Regression

A linear regression was then generated in R to analyse the impact of the above variables in LTV.

Due to the VIF detecting multicollinearity across the variables for PMI and Back-End ratio, these two variables were dropped from the model.

Given that LTV is expressed in percentage format, the log of Income and UPB was used as the independent variables in the regression model.

The following results were generated:

> # Linear Regression  > reg1 <- lm(LTV ~ log(Income) + log(UPB), train)> summary(reg1) 
 Call:
 lm(formula = LTV ~ log(Income) + log(UPB), data = train)
 
 Residuals:
      Min       1Q   Median       3Q      Max 
 -0.78943 -0.06494  0.02240  0.08982  0.42595 
 
 Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
 (Intercept)  0.916374   0.015363   59.65   span class="hljs-number">2e-16 ***
 log(Income) -0.061727   0.001654  -37.32   span class="hljs-number">2e-16 ***
 log(UPB)     0.045761   0.001569   29.17   span class="hljs-number">2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
 Residual standard error: 0.1536 on 44789 degrees of freedom
 Multiple R-squared:  0.03078,    Adjusted R-squared:  0.03074 
 F-statistic: 711.3 on 2 and 44789 DF,  p-value: span class="hljs-number">2.2e-16

From the above, the selected variables show high significance at the 5% level. However, the R-Squared of 15% is quite low. This indicates that while these variables are of importance in explaning the variation in LTV, there are either some important variables that are missing in this regard, or LTV itself is subject to a high degree of random variation.

Given that this is the case, the eventual multilevel modelling will not be used for predicting LTV per se – rather to analyse whether significant differences exist across this variable by state.

The Breusch-Pagan test for heteroscedasticity was run, and the VIF test for multicollinearity was run once again:


> library(lmtest)> bptest(reg1)     studentized Breusch-Pagan test
 
 data:  reg1
 BP = 1229.1, df = 2, p-value < 2.2e-16
 
 > library(car)
 > vif(reg1)
 log(Income)    log(UPB) 
    1.828859    1.828859

From the above, a VIF < 5 indicates that multicollinearity is not present in the model. However, a p-value of below 0.05 for the Breusch-Pagan test indicates that heteroscedasticity is present in the model. This is an issue across this dataset, given that different states with varying incomes are being represented. This results in a situation where states with higher incomes and higher housing prices are being given more weight by the model, skewing the results.

Multilevel Modelling with lmer

In order to mitigate this issue, a better solution is to analyse LTV variation by State, specifically filtering by FIPSStateCode as illustrated in the dataset.

In this regard, random effects are firstly being tested for. It is hypothesised that the effect of Income, UPB and Loan Amount will vary across different states. Therefore, random slopes and intercepts are being generated in this regard.

> # Random effects> r1 <- lmer(LTV ~ 1 + (1 + log(Income) |FIPSStateCode), data = train, REML=FALSE)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
   Model failed to converge with max|grad| = 0.0179241 (tol = 0.002, component 1)
 > r2 <- lmer(LTV ~ 1 + (1 + log(UPB) |FIPSStateCode), data = train, REML=FALSE)
 > r3 <- lmer(LTV ~ 1 + (1 + log(Amount) |FIPSStateCode), data = train, REML=FALSE)
 Warning message:
 In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
   Model failed to converge with max|grad| = 0.0318744 (tol = 0.002, component 1)
 > anova(r1,r2,r3)
 Data: train
 Models:
 r1: LTV ~ 1 + (1 + log(Income) | FIPSStateCode)
 r2: LTV ~ 1 + (1 + log(UPB) | FIPSStateCode)
 r3: LTV ~ 1 + (1 + log(Amount) | FIPSStateCode)
    Df    AIC    BIC logLik deviance   Chisq Chi Df Pr(>Chisq)    
 r1  5 -43118 -43075  21564   -43128                              
 r2  5 -43906 -43862  21958   -43916 787.748      0  span class="hljs-number">2.2e-16 ***
 r3  5 -43929 -43886  21970   -43939  23.235      0  < 2.2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

For models r1 and r3, a “Model failed to converge” error was obtained. Moreover, when testing for ANOVA, we see that the fit improvement between r2 and r3 is marginal. Therefore, r2 is used to represent the random effect for the model.

Now, an attempt is made to model the fixed effects between Income and LTV.

> # Fixed effects> m1=lmer(LTV ~ log(Income) + (1 + log(UPB) |FIPSStateCode), data = train, REML = FALSE)Warning message: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
   Model failed to converge with max|grad| = 0.264243 (tol = 0.002, component 1)
 > summary(m1)
 Linear mixed model fit by maximum likelihood  ['lmerMod']
 Formula: LTV ~ log(Income) + (1 + log(UPB) | FIPSStateCode)
    Data: train
 
      AIC      BIC   logLik deviance df.resid 
 -45886.6 -45834.4  22949.3 -45898.6    44786 
 
 Scaled residuals: 
     Min      1Q  Median      3Q     Max 
 -5.5590 -0.4516  0.1039  0.6687  3.1559 
 
 Random effects:
  Groups        Name        Variance Std.Dev. Corr 
  FIPSStateCode (Intercept) 1.395926 1.1815        
                log(UPB)    0.008612 0.0928   -1.00
  Residual                  0.020841 0.1444        
 Number of obs: 44792, groups:  FIPSStateCode, 53
 
 Fixed effects:
              Estimate Std. Error t value
 (Intercept)  1.642752   0.019925   82.45
 log(Income) -0.071496   0.001586  -45.09
 
 Correlation of Fixed Effects:
             (Intr)
 log(Income) -0.935
 convergence code: 0
 Model failed to converge with max|grad| = 0.264243 (tol = 0.002, component 1)

However, a “Model failed to converge” error is obtained once again. In this regard, the random effects portion of the model generated earlier will be used to model this problem.

Here are the summary coefficients for the model:

> summary(r2)$coefficients             Estimate  Std. Error  t value(Intercept) 0.8032787 0.007135987 112.5673

Here is a breakdown of the intercept and slope across states:

> coef(r2)$FIPSStateCode       log(UPB)  (Intercept) 1  -0.048725940  1.424839137
 2   0.071810557 -0.129437630
 4   0.048194732  0.144809114
 5   0.009605831  0.686029830
 6   0.079868545 -0.373252028
 8   0.066271417 -0.058015008
 9   0.002805582  0.825132568
 10  0.041174114  0.313942756
 11 -0.021533357  1.059306143
 12  0.048449155  0.193163918
 13  0.029536299  0.427508846
 15  0.033478378  0.274425498
 16  0.083027128 -0.230813954
 17 -0.037538462  1.267887084
 18  0.062788077  0.002831210
 19  0.012487451  0.632244118
 20  0.042927859  0.273238843
 21  0.084679686 -0.278537734
 22  0.013490445  0.638005090
 23  0.196958251 -1.714481216
 24 -0.064853734  1.666322971
 25  0.092872449 -0.434507155
 26  0.063390995 -0.008880011
 27  0.010592391  0.649027798
 28 -0.021592652  1.134230484
 29  0.046409632  0.220361661
 30 -0.015760240  1.003635842
 31  0.030503923  0.412809350
 32  0.128775010 -0.842500079
 33  0.042681993  0.239718731
 34  0.079823518 -0.273370374
 35  0.064508955 -0.037793158
 36 -0.026011440  1.070115946
 37 -0.017808095  1.028135790
 38 -0.028361889  1.180258974
 39  0.059861306  0.032634645
 40  0.084900070 -0.235779484
 41  0.134953722 -0.918560447
 42  0.009586766  0.690380057
 44  0.018198628  0.597527446
 45 -0.084074305  1.892673690
 46  0.024345909  0.479053971
 47  0.090823001 -0.344979523
 48 -0.003117153  0.856093955
 49  0.084008316 -0.283965571
 50  0.084902254 -0.165635531
 51 -0.006251239  0.886143183
 53  0.041150673  0.227720105
 54  0.019188765  0.541915533
 55  0.005422513  0.753196009
 56  0.085633968 -0.296857819
 66  0.118784559 -0.560868823
 72 -0.055580831  1.595271936
 
 attr(,"class")
 [1] "coef.mer"

The data is then aggregated (i.e. the average UPB, LTV, Income, and Loan Amount is calculated), and the following data frame is generated along with the random slope and intercept across states:

When analysing the data frame as a whole, it is quite interesting to observe that there is generally an inverse relationship between the Intercept and Slope, i.e. the lower the intercept, the higher the Unpaid Principal Balance across states, and vice versa.

Here are some visualization plots:

UPB vs LTV

LTV vs Income

UPB vs Income

Key Findings

From the plots, we can see that:

There is a negative correlation between UPB and LTV, i.e. as the average LTV falls, the average UPB rises.
There is a negative correlation between LTV and Income, i.e. as the average income rises, average LTV falls.
There is a strong positive correlation between UPB and Income, i.e. as the average income rises, average UPB rises.

These findings are interesting in that they illustrate higher incomes are not necessarily the main predictor of risk when extending home loans. In the case of states with higher incomes, we have seen that the Unpaid Principal Balance rises quite significantly, reflecting a combination of 1) higher house prices in that state, and 2) the propensity of higher-income earners to purchase homes with significantly higher prices than that of lower-income earners.

Therefore, while states with higher incomes had slightly lower LTVs, the effect was not overly pronounced.

In this regard, one possible implication of this is that an institution can choose to set parameters as to their lending criteria, depending on how conservative the institution wishes to be.

For instance, an institution could choose to focus its marketing efforts on states with a UPB of less than $300,000 and an LTV of lower than 0.75.

Or, if the institution wanted to take some risk in terms of the loan size, another criteria set could be a UPB of up to $600,000 but an LTV of lower than 0.70 to compensate for the added risk.

Conclusion

In this example, we have seen how multilevel modelling is used to conduct regression analysis across different segments in a dataset, and specifically how this has been applied in assessing mortgage risk. The original post and GitHub repository can be found here.