Nowadays, there are numerous risks related to bank loans both for the banks and the borrowers getting the loans. The risk analysis about bank loans needs understanding about the risk and the risk level. Banks need to analyze their customers for loan eligibility so that they can specifically target those customers.
Banks wanted to automate the loan eligibility process (real time) based on customer details such as Gender, Marital Status, Age, Occupation, Income, debts, and others provided in their online application form. As the number of transactions in banking sector is rapidly growing and huge data volumes are available, the customers’ behavior can be easily analyzed and the risks around loan can be reduced. So, it is very important to predict the loan type and loan amount based on the banks’ data.
In this blog post, we will discuss about how Naive Bayes Classification model using R can be used to predict the loans.
Customer loan dataset has samples of about 100+ unique customer details, where each customer is represented in a unique row. The structure of the dataset is as follows:
These variables are called as predictors or independent variables.
Data preprocessing involves data cleansing and data preparation. As part of data cleansing, check for missing values.
From the above diagram, you can clearly see no missing values.
Debt-To-Income ratio is defined as the ratio of all your monthly debt payments and your gross monthly income. Lenders look at this ratio while deciding on whether to lend money or extend credit.
A low DTI indicates that you have a good balance between debt and income. As you might guess, lenders need it to be low – generally to be below 36, but the lower it is, the greater the chances of getting loans or credit you seek.
DTI = (debts / income) * 100
Response or dependent variables (loan_decision_status) are required to predict loan approval or denial. loan_decision_type field is used to create dependent variables.
Loan status falls under any one of three types of categories such as ‘Approved’, ‘Denied’, and ‘Withdrawn’. Here, ‘Withdrawn’ means that the customer has withdrawn the loan due to varied reasons after the bank approved the loan. So, consider ‘Approved’, ‘Withdrawn’ as ’1′ and ‘Denied’ as ’0′.
Let us try to predict whether loan will be approved (1) or denied (0) and classify it accordingly.
As there are more than two independent variables in customer data, it is difficult to plot chart as two dimensions are needed to better visualize how Machine Learning models work.
To reduce dimensions, perform the following:
Multiple models can be executed on top of the customer dataset to compare their performance and error rate so as to choose the best model. In this blog post, Naive Bayes Classification Model with R is used.
To apply Naive Bayes classification model, perform the following:
With the choice of Naive Bayes Classification, it is evident that the accuracy for this model is evaluated as 71% and error rate as 29%. The accuracy of the model can be improved with other classification models using parameter tuning.
The chart explanation is available here