1  Artificial Intelligence vs Machine Learning vs Deep Learning 

2  Algorithm and Model 

3  Type: Supervised Learning and Unsupervised Learning 

4  ML workflow 

5  DAVinCI LABS Workflow

1  Problem Definition 

2  Data Preparation   

3  Variable Processing   

4  Modeling   

5  Model Evaluation  

  • Evaluation Index  
  • Overfitting vs Underfitting 

6  Result Interpretation 

  • Explanation

7  Model Deployment and Operations

  • Manual Combined Variable 
1
Why do you need
data analysis?
As the 4th industrial revolution has started,
many companies are trying to make better business decisions by
using the data they have accumulated over the years. 

For any organizations to make the right decision, one must rule out subjective bias and derive insight based on objective correlation, or in other word based on data. The rapid growth of various industries is also being achieved via prompt analysis and control of various diverse customer information. This leads to consumer satisfaction and its corresponding benefits, making data analytics more important. 


Today, data analytics in various industries has progressed rapidly from traditional statistical analysis to the use of artificial intelligence to gain insights. So, what is AI, machine learning, and deep learning that everyone has heard of at least once? 

2
What is
Machine Learning?


Artificial Intelligence vs Machine Learning
vs Deep Learning 

“Artificial intelligence”, “machine learning”, and “deep learning” are concepts that we inevitably face today. At first glance, they seem like different concepts, but in reality, the relationship between “artificial intelligence ⊃ machine learning ⊃ deep learning” is that the high level concept of artificial intelligence refers to the realization of human thoughts and learning abilities with computer programs, while the machine learning is lower level concept, that refers to a concrete approach to implementing artificial intelligence. 

Above all, machine learning has become a hot topic and

has been used the most because the data that companies have

is the most suitable for analysis with machine learning. 

To be more specific about machine learning, it is an artificial intelligence technology that analyzes data using algorithms, learns through analysis, and judges and predicts information that has not been entered based on what has been learned. Also, unlike traditional programming, where the rule is given to find answer, machine learning has the characteristics of finding the rules based on the answer given. 

Traditional Programming vs Machine Learning 

The biggest difference of traditional statistical analysis and machine learning methods is in the goal rather than the algorithm or method used. In case of machine learning methods, the goal is to increase the probability of success of prediction. Therefore, the reliability of the model or sophisticated assumptions as well as what factors are important and why are relatively less important, compared to traditional statistical analysis. On the other hand, traditional statistical analysis aims to reduce the probability of failure through an established distribution or assumption. Therefore, it pursues simplicity rather than complexity of the model, and the interpretability of parameters is also important. For example, the analysis is done to explain why the customer is spending their money. 


Deep learning is a sub-concept of machine learning, and it is one of the representative methodologies belonging to machine learning. Rather than learning along a path set by an algorithm, it is characterized by being able to independently learn data and make decisions through a neural network composed of multiple layers. The difference is that deep learning solves problems by classifying a lot of data by itself and then identifying it. In addition, in the case of deep learning used in image recognition and voice recognition, a lot of data is required for learning, and the more data, the higher the accuracy. 


Here, let's look at what are the algorithms that are repeatedly mentioned, and what are the concepts of models that are often used.  

2


Algorithm and Model

Machine learning is a series of processes that can analyze and predict data using algorithms. To begin with the definition, an algorithm is a combination of calculations or rules that specify the order of steps required to solve a problem, and a model is an algorithm that has completed data training as a higher concept of the algorithm. 


To give an example for a more intuitive understanding, you can understand that electronic products and parts are data, related manuals are algorithms, and assembled electronic products are models. It is difficult to operate the product with only accessories or manuals. 


Also, the biggest difference is that an algorithm alone cannot make a decision, however, the model returns output based on the input which can be utilized for making a decision. When applied to DAVinCI LABS, the user uploads the prepared data, trains it with various algorithms provided by DAVinCI LABS, selects a champion model and uses it for decision makin 

3


Type: Supervised Learning
Unsupervised Learning 

Through previous chapter, we have learned that machine learning is the act of computers learning data. Just as humans learn knowledge in various ways, how the computer learns is also subdivided in various ways. Typically, it is divided into supervised learning and unsupervised learning. 


If we compare each concept to humans, there are various ways in which humans learn knowledge, but most of the learning is triggered from the record of previous experiences or, if there are no prior examples, it is directly encountered and acquired after various attempts. Knowledge that has already been experienced in the past is recorded and transmitted along with the results. In other words, if you learn the record, you can make inferences based on that record when you are faced with a situation similar to that case. However, not all knowledge may have been experienced in the past. In the case of completely new knowledge, there is no right answer, so digging up the knowledge one by one is learning.   

This is also same when a computer learns, let's assume that 'knowledge = data'. If there is a history in the data, the computer can learn the history and draw appropriate conclusions in similar situations. This is called supervised learning. In other words, it is possible to predict future possibilities by learning from the records of the past. 


On the other hand, what about data with no history? Since computers do not know the correct answer, they aim to understand patterns based on the data itself and classify similar types of knowledge (data samples). This is what we call unsupervised learning

Supervised learning is mostly used in real life. Because it's usually more of a concern to predict what's going to happen than to simply spot trends. In this context, DAVinCI LABS provides automated supervised learning among ways of machine learning. Therefore, the data you upload to DAVinCI LABS must include the historical record of the target you want to predict. For example, assuming that you are developing a customer churn prediction model, it is the “churn or not”.    

4


ML workflow

In general, creating a predictive model using machine learning involves a series of processes ranging from problem definition, data preparation, model development, model evaluation, model deployment, and operation (prediction). 


You must have right approach on the problem definition, so that you can prepare the data properly and achieve more valuable results. In case of creating a loan approval model for a new customer., 


1

Problem Definition  

Train the loan repayment history data of existing customers to create a better decision-making model. In other words, according to the existing internal rules, among the customers who would have been rejected, customers who can afford to repay are approved, and among the customers who may have been approved, customers who cannot afford to pay are rejected. In problem definition, it's important to understand your business and data well, and set the task you ultimately want to solve. 


2

Data Preparation  

All data is rarely in one place. You need to find relevant data for problem definition. For example, create a large dataset that characterizes the customer, including variables that reflect the customer's financial What is 2 Machine Learning? information and appraised social credit rating scores. Of course, it should also include the target value that labels the correct answer and correct repayment history, which is essential for training. You need to put the correct values labeled, i.e., valuable data to reach a valuable conclusion. 


3

Model Development  

Develop a model that can determine whether a new loan applicant is approved or not by training the prepared data with various algorithms. 


4

Model Evaluation  

The developed model is immediately deployed and evaluated before operation to determine whether to use it or not. The evaluation index is largely different depending on the target type (regression or categorical). Loan approval is a binary classification problem, and the final model with good performance is selected by using the corresponding evaluation index. 


5

Model Deployment  

No matter how powerful a predictive model is, it is meaningless if it is not available to use by practitioners. Therefore, once the model evaluation is completed, the process of model deployment for operation must take place. Typically, a trained model can be applied to operating environment through a separate software, which is the predictive server. 


6

Operation (apply prediction result)   

The deployed model shows whether new applicants are approved in real time during the operation phase. However, these models cannot be used permanently. Why? Because models age just like humans. As time goes by, the data of new applicants will also become historical data that can be trained, and the trend of the data will change. Therefore, in order to prevent the decline in predictive performance, you need to retrain the model according to an internal standard or if there is no such standard, it is necessary to establish a standard for when to retrain the model. 

5


DAVinCI LABS Workflow

As mentioned above, a typical machine learning workflow consists of complex and repetitive tasks from data collection to data preparation, variable processing, model development, model evaluation, model deployment, and operation. DAVinCI LABS provides convenience to users by automating the entire process from data preprocessing to model deployment. This improves the work efficiency of practitioners and minimizes the potential errors of manual work. In addition, in the past, high-level expertise was required to use machine learning, but with DAVinCI LABS, anyone in the field can easily proceed from model development to operation. 


However, although AI can make decisions based on data, it is up to humans to interpret and evaluate the results of those judgments. 

3

Predictive Model 

Using DAVinCI LABS

1


Problem Definition

"The key to developing a machine learning prediction model

lies in the design of clear objectives."  



In defining a supervised learning-based problem, the user will consider two main cass: 

Defining problems and setting goals are both the most difficult and important challenges. To create a meaningful model, it is necessary to clearly define what data, how to train the data, what kind of result to be produced in the problem definition stage. This step is always emphasized because the data set that needs to be prepared for the next step will be different depending on what kind of problem you want to solve. For example, what you're trying to predict might be whether a new customer's loan application will be approved, or it might be a price prediction. In this guidebook, we will create a model that predicts churn using the customer churn data provided by the Kaggle as an example. 

DAVinCI LABS can solve binary classification problems by regression.
When solving binary classification problems by regression,
it has the advantage of being able to see the prediction result with probability.
Let's model the corresponding churn problem below in both regression and classification to see the results. 
2


Data Preparation

Collecting and preparing data takes up most of the time in a

single machine learning project. You need to prepare and learn

"quality" data so that you can get good "quality", reliable result.


In DAVinCI LABS, you can consider the following preparation steps

Join Data collected from multiple resources  

In the financial sector, for example, customer information data and contract data are merged to gather the necessary information. If two data exist individually, within DAVinCI LABS, you can create a new data set by merging two or more data sets by using one of the four methods: INNER, LEFT, RIGHT, and OUTER by setting the sample identification variable (e.g. ID) to the key value. 

Data Set Split 

For machine learning training, data is mainly divided into training, validation, and test sets. When data is added periodically, it can adapt to changes in the data and better reflect the actual prediction by training from the past data and validating it with the latest data. The test set is only used to evaluate the final performance of the trained model after all training is complete. Test sets are uploaded separately according to the data size according to the ratio of 7:3 or 8:2. The training and validation sets are set by the user in a ratio of 7:3 or 8:2 on the modeling page in DAVinCI LABS. (If you have enough training data, you can also take the test set in a 9:1 ratio.) 

Data Exploration    

This is a process that allows you to intuitively look at the collected data from various angles with the purpose of understanding the data. Data that is not processed is by no means an interface optimized for human understanding. DAVinCI LABS can quickly and repeatedly explore data, and provides visual feedback through data visualization, statistics, and more. 



For example, on the Dataset page of churn data,


First,
the target average graph of variables representing regions shows
that Germany has a significantly higher churn rate (approximately 20%) than other regions,
even though the number of samples is relatively smaller than that of France.

Second,
if you look at the target mean graph of variables indicating age,
you can see that the rate of churn from age 45 increases rapidly,
exceeding the average rate of churn by age 65. 


Other things to consider are 


1

Delete variable   

If it is determined through the provided statistics that most of the data’s value is missing and that there is no significant impact on predicting the target with domain knowledge, delete the variable 


2

Ignore variable   

Personal identification variables CustomerID, RowNumber, and Surname are excluded before modeling because it is necessary to predict customers with such characteristics rather than specific customers.  


3

Correlation analysis   

Correlation also allows you to check the relationship of the variables and the target to decide whether to use a variable or not. 


4

Change type of variable 

Change the variables that require a change of type before modeling. A tip when judging whether to change the variable type or not is to change variables that are not comparable to categories. Manually change the card ownership status (HasCrCard) and customer activity status (IsActiveMember) variables recognized as numeric types among churn data to categorical types. Contrary to the example, the case of changing a categorical type to a numeric type is more common.  


Target Definition     

The variable you want to predict, in this example “Exited”, where 0 means maintain and 1 means churn. If “Exited” is set as a numeric type before target selection, it is recognized as a regression problem and is displayed in red after target selection (refer to the figure). On the contrary, if “Exited” is set as a category type, it is displayed in blue. 

Pre-processing      

In data analysis, the pre-processing process is the most important and is a way to create high-quality data. Pre-processing, which has a significant impact on the final prediction performance, includes missing values, categorical variable encoding, data scaling, etc., is automatically performed before modeling in DAVinCI LABS.  

3


Variable Processing

If the data is ready, proceed with baseline modeling. However, there are two main options of variable processing for more advanced modeling. One is to generate meaningful derivative variables to derive the potential value of data, and the other is the feature selection function, which automatically selects variables required for modeling. 

Feature Engineering      

Not only can you find meaningful derivative variables, but it can also improve the performance of your model. In DAVinCI LABS, users can create derivatives variables in two ways: The first method is to automatically find combination of linear combinations, multiplication, and division between variables within a click, and the second method is to manually enter formulas.  

Feature Selectio

When dealing with data, it is common to see variables ranging from hundreds to thousands. In this case, the model itself is heavy and the training time is long. An efficient way to reduce model complexity and improve explanatory power is by selecting only relevant variables from an efficient perspective via feature selection. It is not always the best to use all variables to maximize performance, but you also need to consider how quickly the model can be updated and utilized without much difference in performance. If there is a lot of noise in the data, reducing the variables often results in better performance. 


For example, in the figure below, when 149 variables are used, the Gini value is 0.638 and MSE is 0.865, and when 119 variables are used, the Gini value is 0.668 and MSE 0.860. In this case, you can save modeling time and get better performance by using a dataset consisting of 119 variables.  

4


Modeling

If you have prepared data to be trained through Feature Engineering or Feature Selection, proceed to the next step, modeling. The purpose of modeling is to find the optimal model that reflects the data thoroughly through training, and finding the optimal model, means to maximize the accuracy of the model. In the past, it has been common for data scientists with expertise to repeatedly adjust hyperparameters to find the optimal model. In contrary, DAVinCI LABS provides Auto Modeling that automates repetitive hyper parameter tuning task and provides the result of algorithms such as linear, tree, and neural network lines, then provides the models in order of high performance.  

Cross-validation

This step is required to measure how well the model predicts new data (data not used for training) before verifying the final performance, in other word, it is to validate the training set. When training and evaluating with a fixed training set and test set, cross-validation is performed to prevent overfitting to the test and predict the new data incorrectly. Since the cross-validation method utilizes all datasets for Cross-validation [Target-Regression] Auto Modeling Result evaluation and training, it is possible to create a generalized model and improve accuracy by preventing data bias. In DAVinCI LABS, users can set related conditions (number of k-folds, data split) in advance before modeling. 


If you have set the conditions for modeling including the validation dataset, you can proceed with Auto Modeling to check the results. For this churn data set, Random Forest algorithm received the most stars based on Gini performance, on simple data, a model trained with a single algorithm performs better than an ensemble-type model that combines multiple algorithms. It would be good to test all the models in the future, but time is also a cost, so in general, the top 5 models are selected and tested. If you value explainability or usability over the performance, you can choose one single algorithm rather than ensemble. 

5


Model Evaluation

At the end of the modeling, an evaluation of the model determines which

model to select as the final model, i.e., whether to use it or not.

It is an essential indicator in determining whether a model can be put into operation or not. For example, how many receivables would a bank need to collect from individuals to be considered successful? Or how many customers purchase the product through marketing campaign? Or how well can you identify churners? etc. Indicators that can quantitatively represent these criteria have different uses in regression and classification problems, as shown in the figure. 

The figure above is the result of creating a model that classifies churn customer based on Gini, and it can be seen as a stable model with no significant difference in the performance of training, validation, and test sets.

For classification problems, the performance of binary classification is evaluated based on two criteria. Simply put, it is how well you find what you want and how well you filter out what you don't want. DAVinCI LABS visualizes these two methods as a 2×2 confusion matrix and user can select Accuracy or F1 score as the evaluation criteria by the confusion matrix (In addition, there are other evaluation criteria such as precision and recall.) In general, if the composition ratio of the target is very small and is unbalanced, the F1 score, which is the harmonic average of precision and recall, should be selected rather than Accuracy. For example, in case of fraud detection, if there are 2 cases of fraud in 1000 transaction data, Accuracy shows a high performance of 99.8% even if all transactions are normal. The figure below is the result of creating a model that classifies churning customers based on the F1 score, and it can be seen as a stable model with no significant difference in the performance of training, validation, and test sets.

One of the biggest concerns when using machine learning is the overfitting problem. Overfitting fits the training data well, but it has poor generality, so when new data is given, the prediction performance is relatively poor and the model becomes meaningless. As an extreme example, if a classification model memorizes the full name and history of a churning customer, the training data will be perfectly predictable, but not usable for new customers at all. In the contrary, underfitting is when the model is too simple to properly reflect the characteristics of the data. 

Various methods exist to prevent these problems, and one of them is to find more stable model through the above-mentioned cross-validation method. In the churn data example, the training, validation, and test sets of the champion model, Random Forest, perform 0.57, 0.59, and 0.57 based on F1-score, respectively, with no significant difference in performance and can be considered a stable model. 

6


Result Interpretation

Machine learning lacks explanatory power compared to its excellent predictive power, so it is easy to encounter difficulties in decision-making and communication within the company. DAVinCI LABS provides an explanation function to improve explanatory power as well as sorting and providing important variables for each model. Below is an image that shows the explanatory power of a model using a random forest, champion model for the example data set, when solved with a regression problem, and you can visualize the positive or negative impact on the target as a scale. As can be seen from the figure, the number of products and age are the variables that have the most impact on whether customer churn or not. 

You can also check how much each variable impacts the target prediction in more detail by checking prediction explanation by customer (by sample). For the first customer (sample), the predicted value is 0.16, closer to 0 than 1, and for the second customer (sample), the predicted value is 0.88, closer to 1, indicating that the second customer is much more likely to churn than the first customer. Then, you can check which variables have the most impact in predicting the churn rate for these two customers through the explanation and its score. 

7


Model Deployment and Operation

When the model training and evaluation is completed, the result downloaded in Excel format can be used in practice. For example, upload new customer data as test dataset to get predictions result from a pre-trained model. Furthermore, you can deploy the model to the operation team to operate and utilize it in real time. This cumbersome task can also be easily carried out in two ways within DAVinCI LABS. The first is to download the prediction function and deliver it to the operation team. The prediction function in the form of a Java library can be easily integrated into the local system, and once applied, the model can be updated just by changing the model file. Second, if you use the prediction server, you can deploy a model that has been evaluated through a one-click server connection within DAVinCI LABS. 

4

Rule 

Optimization

If supervised learning is for prediction purposes, the rule optimization function is to automatically or manually find patterns (clusters) based on the target you want to predict. Meaningful clusters that have a significant positive/negative impact to the target mean are visualized and displayed together with statistical values, and new attributes (variables) can be added to automatically found clusters to immediately check how the result changed. 


n the result of running rule optimization as an example of churn data, when an age variable is added to a cluster consisting of one product, the range of the new optimized variable is shown, as well as the total number and target value of the cluster. It is interpreted as there are 111 customers over the age of 44.5 and under 63.5 and have more than 2.5 products are nearly four times more likely (391%) to churn than the total customers. 

.


Manual Combined Variable

Variables grouped through the rule optimization module can also be used

for variable processing during supervised learning in DAVinCI LABS.



For example,

5

Summary

Machine learning predictive models can be

used in most industries that utilize tabular (structured) data.

DAVinCI LABS has successfully improved the insurance underwriting review model in various fields, improved the new loan or renewal review model for personal credit loans, improved the credit card issuance or renewal review model, improved the marketing (account opening, etc.) model, and improved models in many other areas. DAVinCI LABS uncover hidden value from the data and improving the predictive performance and development speed to deliver an optimized result.