Loan Prediction – Using PCA and Naive Bayes Classification with R

Overview

Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers getting the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers.

Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts, and others provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data.

In this blog post, we will discuss about how Naive Bayes Classification model using R can be used to predict the loans.

Data Description

Customer loan dataset has samples of about 100+ unique customer details, where each customer is represented in a unique row. The structure of the dataset is as follows:

Input Variables

These variables are called as predictors or independent variables.

Customer Demographics (state, gender, age, race, marital status, occupation)
Customer Financials (income, debts, credit score)
Loan Product (loan type)

Data Preprocessing

Data preprocessing involves data cleansing and data preparation. As part of data cleansing, check for missing values.

From the above diagram, you can clearly see no missing values.

Debt-To-Income (DTI) Ratio

Debt-To-Income ratio is defined as the ratio of all your monthly debt payments and your gross monthly income. Lenders look at this ratio while deciding on whether to lend money or extend credit.

A low DTI indicates that you have a good balance between debt and income. As you might guess, lenders need it to be low – generally to be below 36, but the lower it is, the greater the chances of getting loans or credit you seek.

1	DTI = (debts / income) * 100

Dependent Variables

Response or dependent variables (loan_decision_status) are required to predict loan approval or denial. loan_decision_type field is used to create dependent variables.

Loan status falls under any one of three types of categories such as ‘Approved’, ‘Denied’, and ‘Withdrawn’. Here, ‘Withdrawn’ means that the customer has withdrawn the loan due to varied reasons after the bank approved the loan. So, consider ‘Approved’, ‘Withdrawn’ as ’1′ and ‘Denied’ as ’0′.

Let us try to predict whether loan will be approved (1) or denied (0) and classify it accordingly.

Convert the loan_decision_status field as factor as shown below:

Exclude applicantId, state, and race from further processes as these fields will not affect the prediction value. Exclude income, debts, and loan decision type as DTI and loan decision status are included.
Encode the categorical variable (gender, marital status, occupation, loan type) as factors.

Partitioning the Data

In predictive modeling, the data needs to be partitioned into train and test sets. 70% of the data is partitioned for training purpose and 30% of the data for testing purpose.
After data splitting, apply Feature scaling to standardize the range of independent variables.

Dimensionality Reduction using PCA

As there are more than two independent variables in customer data, it is difficult to plot chart as two dimensions are needed to better visualize how Machine Learning models work.

To reduce dimensions, perform the following:

Apply Dimensionality Reduction technique using Principal Component Analysis (PCA) on customer dataset except on dependent variable and reduce it to two dimensions.
Before applying PCA, install and load caret package.

Naive Bayes Classification

Multiple models can be executed on top of the customer dataset to compare their performance and error rate so as to choose the best model. In this blog post, Naive Bayes Classification Model with R is used.

To apply Naive Bayes classification model, perform the following:

Install and load e1071 package before running Naive Bayes.
Test the models built using train datasets through the test dataset.
Using accuracy and error rate, understand how these models are behaving for the test dataset.
Determine the best model using these measures.
Use Confusion Matrix/ Misclassification Table to describe the performance of the classification model on a test data. This table is also used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.

With the choice of Naive Bayes Classification, it is evident that the accuracy for this model is evaluated as 71% and error rate as 29%. The accuracy of the model can be improved with other classification models using parameter tuning.

Visualizing Test Set Results

The chart explanation is available here