BLOG

Lending Club evaluates client credit scores through machine learning (Part 2)

루이
5 Jun 2022

Let’s try DAVinCI LABS, Ailys’ AI data analyzation system!

Better be tried than heard! Follow our run-through of DAVinCI LABS implementation. 



Lending Club’s open data is being actively utilized on data analyzation platforms like Kaggle and Github. It’s not just used by data scientists but also data expert wannabes. Let’s look at the outcoming results of DAVinCI LABS using Lending Club data.

One characteristic of DAVinCI LABS is that it offers predictions using supervised learning. When you upload a CSV file formed with tabular data, it produces target variables (we call it a ‘target field’), as well as offering the most appropriate algorithm, and the correlations between different variables. We’ll start with rolling the DAVinCI LABS in three steps.


1.   Upload a Dataset

First step to training the machine

In order to acquire the resulting value, we require a training procedure on the machine. There should be a set equation such as ‘Input A’ leads to ‘B’, as it’s impossible to create something meaningful from absolutely nothing. We’ll start by uploading the Excel form dataset from Lending Club on DAVinCI LABS.

◆We cut the dataset into an 8:2 ratio. This percentage is applied equally into predictive modeling, where 80% corresponds to the training set and the leftover 20% to the test set. For example, it’s equal to parting a workbook into an 8 to 2 ratio to study with the 80%, while saving the latter for the check-up assessment. An additional explanation will be followed up in ‘2. Model Generation’.


With a 100 pages-worth of workbook, we study with 80 pages and leave the rest 20 pages for the testing portion.

 

We separate a CSV file into 80:20 ratio and then save them this way. We will use this ‘lendingclub_train’ dataset for training!


 

We should first upload the original data file(lendingclub_train). This is a preview of the file chosen so that you can prevent any uploading mistakes.


Once the file has been uploaded, we should set the dependent variable of the equation. What we want to predict is whether the client would diligently carry out their duty of paying off the loan or not, so we set the loan_status as the target field (dependent variable).


        Our target field (dependent variable and y value) is ‘loan_status’.


Next is choosing the discard(ignore) parameters. We have a total of 26 input variables, including important categories such as ‘dti(debt to income ratio)’, ‘annual income’ to unnecessary ones like ‘member_id’ and ‘addr_state (state address)’. We will pick unmeaningful ones to exclude from the training process and name them ‘discard(ignore) parameters’. For example, the input parameter ‘desc’ stands for ‘description of the client’ that merely lists clients’ requirements in sentences. It does not add any structure to building the model so it will be taken out of the process.

 


desc(description), zip_code,add_state are completely unrelated to carrying out loan repayment duty therefore they are set to be ignored (discarded).




2.   Model Generation

Let’s produce an equation base referring to the trained data!


DAVinCI LABS offers 11 algorithms in total, from Stabilized Deep Net to LightGBM. You can pick one out with the highest accuracy out of these. We’ll proceed with running all of the algorithms to see which works best.

DAVinCI LABS provides all these analyzation algorithms. We’ll check the best one after running through the whole process!



◆ To keep in mind before officially starting with modeling!◆


Generally, a single dataset is parted into 80% of training data and 20% of test data. You could probably recall the data partitioning process from above. However, this isn’t just it in DAVinCI LABS. There’s more.

There is typically a process called validation in the middle of the data sorting procedure in order to minimize overfitting.

Source of Picture: https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets


DAVinCI LABS contains UX/UI designed for even non-experts to build a training model conveniently. Just by following the provided guidance and clicking accordingly, such Pareto ratio (8:2) is applied! Out of train data, once you randomly pick data and run a validation process, you become able to generate an even more elaborate predictive model.



Data partitioning of ‘Train: validation = 8:2’ is set as the optimum ratio!




Training Progress is rolling on busily.


The machine, once undergone the training process, recommends the ‘Ridge Regression’ as the best suitable algorithm out of all, as it features the highest accuracy of 0.9996. Regardless of being unable to comprehend the mathematical algorithms and equations underneath, all is handled by the machine when it comes to model implementation. (Imagine just how excruciatingly demanding the manual process of calculating each algorithm’s accuracy would be…)


According to the training results, ridge regression is pointed out as the most appropriate, which we will run with for the validation set.


We can also discover five major parameters that have the highest correlation with the target field, ‘loan_status’. According to the ‘Field Importance’ chart, the parameter ‘is_bad’ has the most importance, reaching up to almost 100%.


‘Field Importance’ indicates the five major variables that are the most highly correlative with the target field.




   3. Test

Now that the machine has finished studying, it should officially be assessed!


Finally, we have reached the part of acquiring the ultimate target, whether the loan would be paid back or not by the client! Do you remember the file ‘lendingclub_train’ from above? That portion of 80% training has been taken care of, so now the leftover 20% is waiting to be utilized for model testing. We shall proceed with this with the recommended algorithm ‘ridge regression’. First, we upload the 20% of test data onto the model as shown below!



This time it’s not just the train file. Now it’s the test awaiting!



Now that the test has been set to roll, we will retrieve all cases of applying diverse algorithms to the machine. The chart attached below portrays the results of applying ridge regression that’s been chosen out of the eleven test algorithms.



The confusion matrix displays the accordance of the predicted value with the actual outcome. The result here is super clean, with no errors in the prediction at all!


Shall we observe the predictions of the clients according to the model?


The target field (labeled as target_prediction) appears to have a unanimous result of everyone paying back their loan and completing their duties!

Accuracy in the training set is measured to be 0.9998 and that of the validation set appears to be 1, which in fact indicates to be statistically overfitting.



 The outcome of applying the ridge regression equals an accuracy very close to 1. This is a case of overfitting, which has occurred due to the inevitable alteration of the confidential client information belonging to Lending Club. Overfitting is what happens after ‘over-training ‘the model with data. Regardless, this example is meaningful in the sense that it offers an example of choosing the recommended algorithm for modeling, as well as guaranteeing the highest accuracy.



◆One of DAVinCI LABS’ characteristics, Correlation Chart!!!◆

 

loan_status and is_bad have a very high correlation. There’s no need to even get rid of overlapped features or data processing as their colors appear almost twin-like.


‘Correlation Chart’ organizes and lists the parameters that are linked and connected to each other.

The target field (loan_status) and the parameter ‘is_bad’ portray a correlation value of 1. In other words,  if the value of the loan_status turns out to be “Fully Paid”, the resulting value of is_bad will appear as “No”. It’s common sense to assume that a client who pays off one’s loan fully is likely to be labeled as a “good client” instead of someone who is bad”. On the other hand, if a client does the opposite and is “Charged Off”, the value of is_bad would of course be “Yes”. Such a clean, errorless equation leads to the variable correlation of 1.


 So, that has been the run-through of putting open data into the DAVinCI LABS machine learning model. To rephrase the roles of DAVinCI LABS, we could say the following:


1) Automates machine learning to generate a predicted value

- Proceeds supervised learning with the tabular database



The machine trains on the tabular data and creates predicted values.


2) Notifies how intensely linked the variables are via correlation charts

- Thus reveals the common characteristics of ‘good clients’ in advance


3) Recommends the most suitable model algorithm for the handled data

- Comes up with the best analysis method out of a total of eleven algorithms

 

That has been our journey with DAVinCI LABS, using Lending Club’s open data for figuring out the x and y variables, how the data has been used in another company, and how it is ultimately used in our machine learning model. We hope to have provided you with an idea of how machine learning actually works!