BLOG

Lending Club evaluates client credit scores through machine learning (Part 1)

루이
5 Apr 2022

“Technology that is used to study humans with machines”

It depends on the individual to decide whether someone is trustworthy or not. However, everyone would nod their heads to the need for a tolerable principle equally applied to everyone when it comes to financing. Not just from a statistical point of view but also from common sense, there should be a set of conditions when you are about to lend money to someone. Those conditions would be based strictly on financial information such as the person’s spending habits and whether one is on the verge of bankruptcy or not, rather than focusing on everyday life information like the person’s personality and frequently visited places. You can tell people you can trust with money apart from the people you are close to, yet could not be trusted with the least financial transactions, right?

The first-ever P2P (Peer-to-Peer) fin-tech company ‘Lending Club’, known to moderate loans between individuals, has succeeded in evaluating the credit scores of each client utilizing around 150 variables. It’s commonly known as CSS (Credit Scoring System). Once AutoML has been incorporated to create a concept ‘Machine learning predictive model’, it brings a simple result of whether the subject is credible or not.


How come machines could be trusted to evaluate human credibility?

The key lies in that machines are in fact much more rational than us humans…We tend to follow our hearts rather than trust our logic in the end, right?


The Lending Club project rooted from the idea “why not try reducing financial loss by predicting the customer’s likelihood of loan repayment?” and consequently brought a financial effect of saving $8,300 million worth of loss (I mean, I would invest in this as well. Who wouldn’t?) Unless you hold a certain amount of machine learning-related knowledge, you would be in awe of how such algorithms work in machines, so let’s dive into how exactly this project was weaved into a complete system.


“Once you enter your financial information, I shall grant you wisdom-whether you can be trusted with money or not.”


‘It’s now possible to predict if the client would pay back their loans via machine learning techniques.’

These are available by using supervised learning.

In machine learning methods, there are ‘supervised learning’, ‘unsupervised learning’, and ‘reinforcement learning’.


Types of machine learning methods.


Supervised learning is a learning method that implies “predicting the future based on the current knowledge”. It requires input features and the target variable, which then forms a training set that undergoes a training process. Afterward, the training set is used to guide the computer to generate the correct answer. This whole process is what forms supervised learning. A good analogy would be a smart and studious student utilizing his or her accumulated knowledge to solve a familiar problem. You can typically sort supervised learning into two threads: classification and regression. We will focus on the classification today for the Lending Club case. 


Source: Picture edited(translated) from https://blogsabo.ahnlab.com/2605

Unsupervised learning refers to “predicting the unknown future based on current knowledge”. The absence of the target variable and only the input variable is the main characteristic. It’s a learning method that tracks palpable patterns to reach an answer. Another analogy: there are sometimes (in fact, very rarely) extremely bright pupils who can make amazing inferences from a very limited set of learning backgrounds, aren’t there? It’s similar to those cases. The commonly known unsupervised learning methods are cluster analysis, PCA (Principal Component Analysis), and Association Rules. These are professional techniques and thankfully we do not require background knowledge here for today’s session, so let’s leave it here. 

Source: Identical to the above

Reinforcement learning trains the machine to retrieve the most reward from an action taken as a reaction to the current state. It’s on a higher difficulty level compared to the previously mentioned two learning methods. You can think of it as a game that advances and develops through constant training that consists of repetitive reward and punishment. Do you recall the ‘go’ competition of the century between AlphaGo and Sedol Lee in 2016? Reinforcement learning is the machine learning method used in the AI AlphaGo created by Google Deepmind.

Source: Identical to the above

I’m sure all of you would’ve had an excruciatingly tiresome experience of recording a file, number by number on an Excel sheet manually. By applying this file to a machine learning predictive model, you can obtain the most highly correlated variables with the loan payment out of vast variables such as ‘address’, ‘spending habits over recent 6 months’, ‘annual income’, and so on. Now we shall examine together the type of variables put into producing the algorithms, as well as how the results turned out. Continuing this discussion, we will see in the following post the data utilization examples through DAVinCI LABS demonstration.


Figuring out a client’s credit score


What variables should be used and what target should be set as an output to create the best evaluation system?

A diagram intuitively capturing how supervised learning works

Source:  https://www.enjoyalgorithms.com/blog/classification-of-machine-learning-models

Understanding the basics of ‘supervised learning’ is important to follow the flow of credit evaluation. This type of learning is shown in the intuitive diagram above. To put it in a sentence, it’s something that trains the machine with known data and then challenges it with a familiar task to check its performance. The machine interprets new data through an algorithm identical to the one used in the previous training set. The algorithm repeats its role through a different set of data, that is, to generate a new set of output. Would it help to say it’s similar to a student who is diligent and clever yet passive? 

The model once trained on supervised learning with Excel file data can repeat its actions over and over again on any similar tasks, as long as they share the same algorithm flow. Here, a renowned open-source software company in Silicon Valley appears. Its name is H2O.ai, and it attempts to operate AutoML modeling with the data provided by Lending Club.


H2O.ai: Let’s see how this open-source software company utilized the existing data here.


Lending Club, so far, has defined action patterns of the users through ‘loan_status’, a variable that indicates whether the user has fully paid the loan or not. It was the very source of data used to predict similar data.

In other words, the target variable of the predictive model (referred to as the dependent variable as well as the output) is the ‘loan status’.

The target variable is recorded in either form: ‘Fully Paid’ or ‘Charged Off’. That is, it’s classified into two kinds of answers, loan payment covered and uncovered. To be ‘charged off’ means that the claimant of the money (the creditor) has given up on retrieving the lent amount from the person who needs to pay back the borrowed money (the debtor). The loss is entirely passed onto the creditor. Thus, it’s safe to declare that this whole project has been conducted to minimize the size of this loss.

 

A and B both got a loan from Lending Club: A for his surgery, and B for refurbishing her home interior.

   

A diligently pays back the amount he has borrowed, while B doesn’t despite the constant reminder from Lending Club.

Afterward, Lending Club enters records of A and B. A is recorded as “Fully Paid”, and on the contrary B is recorded as “Charged Off”.


There are record types other than “Fully Paid” and “Charged Off”, such as “Current” and “Late”, but we kept the pool simple with just two variables. You could say we adjusted the data input for easier understanding of the readers.

Moving on, what were the variables put into consideration? There were too many (approximately 150 in count), so we’ll list a representative few.

One major input variable is the ‘annual income’. Other than that, there are also ‘(number of) open accounts’, ‘(number of accounts with) loan delinquency for recent two years’, ‘total balance’, ‘interest’, ‘home ownership (rent/ own / mortgage)’ and so on. Can you also spot personal information such as zip code and member ID? A CSV file holding all such information is used for modeling.


Finance data is arranged in columns on Excel File. ‘loan_status’ is the name of the target value we are working to generate.


A notable point of Lending Club’s open data is that it sorted the form of loans into good and bad. It tries to define the situation before making a proper prediction. In a case where a client’s credit score is good or loan payment has been completed, the variable is recorded as ‘good loan (is_bad= “No”)’, while the opposite case is put down as a ‘bad loan (is_bad= “Yes”)’. By labeling the loan status like so, the training process is further simplified. This part gains recognition in Part 2 of this piece of writing (Lending Club Part 2).

Once you put this file into the machine learning predictive model, the machine processes a flow of matching the loan status with the amount of annual income, loan interest, and the average amount of expenditure. In other words, the model can now evaluate the variable values of new data and predict whether the new customer would fully complete the duty of payment, based on the trained data. How intuitive!


Results of Machine Learning:

We inform you of not just the target values but also correlations across variables!


H2O.ai uses a predictive model named H2O-3 AutoML. XGBoost, GLM, Deep Learning, GBM, and such various algorithms are provided- which one appears as the most accurate one?

 

https://www.h2o.ai/blog/building-ai-ml-models-on-lending-club-data-with-h2o-ai-part-2/


Judging by the fact that the ‘Stacked Ensemble Model’ shows the highest AUC, it seems to be the most suitable for this model. AUC refers to ‘Area Under the (ROC) Curve’, and it’s a statistical concept that acts as a performance index of binary classification. The closer its value is to 1, the better classified it proves to be.

Through this Stacked Ensemble algorithm, we could generate the target variable of the new input data.

The predicted value from the stacked ensemble algorithm: Bravo! It seems like all ten clients will fully cover their duties of loan repayment!

Do you recall that machine learning models could not just give algorithm recommendations but also inform the input variables that have high correlations among them?


“Machine Learning Models also inform us of the most highly connected input features with the target variable!”


The blue bar indicates positive correlations, and the orange one implies the negative correlation with the variable. 

Source: Same as the above.

The graph above is the variable correlation analyzed by the H2O GLM model. We can gather from this piece of information that the higher the interest, the lower the credit score (grade), the more the account number of delinquency, and in cases of rent instead of a mortgage, the more likely the client gets marked as charged-off. On the contrary, it’s more probable that the client’s loan gets fully paid off once the client owns a mortgage, the credit grades are high, and the credit limit is high. Out of the variables listed in the graph, the one with the highest correlation with the loan being ‘charged off’ is ‘int_rate’, known as the interest rate. How helpful that you could figure out the correlative variables as well!

In this session, we mainly focused on how Lending Club data was put together, and how H2O.ai machine learning model analyzed the data with its choice of algorithms. Coming up next is implementing this exact data with Ailys’ supervised learning predictive model, DAVinCI LABS, not to mention its ultimate results!