BLOG

Pitney Bowes Detects Fraudulent Orders through Machine Learning Predictive Models (Part 2)

루이
6 Jun 2022


Pitney Bowes’s story of detecting foreign fraudulent orders has been briefly reviewed last time. We shall now experiment with DAVinCI LABS to see the exact algorithms utilized and the detailed procedures incorporated.

Using the data used for Google AutoML Tables, let’s demonstrate the fraud detection system with Ailys’s AutoML solution, DAVinCi LABS!

The model implementation process of DAVinCI LABS consists of three chunky steps: Data Upload- Model Generation- Test.




1.   Data Upload and Target Feature Selection

Remember to distribute Train-Test data in a ratio of 8:2!

Two files each holding information about ‘Transaction’ and ‘Identity’ have been combined into a single dataset. Once this process has been handled, we now use this new data file to generate two chunks: 80% for training, and 20% for the test.


※’train: test&rsquo=8:2 ratio has been previously mentioned in Lending Club (Part 2). The following segment will go into further detail about this proportion.


Train: test = 80:20 segmentation has been completed!


Now these two CSV files will be uploaded to DAVinCI LABS. This is the first step to take in fraud detection. 


We check if the given data is chosen well with the ‘preview’ option, and then generate a new project.


DAVinCI LABS is designed in a way in which good quality data guarantees an easy and speedy retrievement of the target value. As soon as you attach the data file and click the button of adding a new project, DAVinCI LABS leads you to the next step.



Now you choose the target field (target variable). The variable labeled ‘isFraud’ is what we are trying to find the value of. The number of input fields (input variables) is 433! That is A LOT. Too much, in fact.

Once the input variable is on the x-axis, the generated variable is the one in the y-axis, right? We choose “isFraud” as the target field, as our goal is to determine if the order is fraudulent or normal.


▶▶Set the discard field (ignored variable)


Transaction ID is a meaningless variable, so we choose ‘ignore’.


The transaction ID is not important so we will remove it from our list of variables to consider.



▶▶Check the correlation chart!


A correlation chart displays the links between parameters in the number form.


DAVinCI LABS contains a ‘correlation chart’ as the reference to acknowledge the relations of variables. It enables further use or application of data by revealing the correlation between the parameters.

You would be able to spot a vivid red rectangle in the top left of the diagram. C1o, C4, C12 and C10, C8, C4, C7, C12 are all perfectly proportional to each other. This means that they all move in the same direction- they are in absolute harmony in correlation.


Now that we have completed data upload and the creation of a new project, shall we dive into the training process?



2. Model Generation

With just a few clicks you are granted a model right away


DAVinCI LABS offers 13 analysis algorithms this time. We’ll try processing all thirteen and see which one would be the most appropriate.



Thirteen algorithms listed above will be simultaneously tried out to see which one shows the highest accuracy (Auto-tuning option is also provided but it’s alright to try the default settings).




Data options and data output options are granted as well but to go along with the most basic options, we’ll proceed with the default options.


You can witness the train set and the validation set divided into an 80:20 ratio, right?

Now comes the data training procedure. 



 

Taking a little bit more time due to the vastly increased amount of data and algorithms- get working, DAVinCI LABS!


3.   Test

Let’s test the model that has been recommended with a few clicks




The model recommends ‘Gradient Boosting Machine’ to be the most appropriate!


The machine, undergone the training process, recommends ‘Gradient Boosting Machine’ to be the optimal analysis frame out of thirteen algorithms. Next recommendations vary from the linear model, ridge regression, and logistic regression. We’ll choose the ‘Gradient Boosting Machine’ for testing the model.

One thing to keep in mind is to test the model with the data assigned as ‘test data’, not the train data we’ve already used before! There was a previously given explanation from the last post as to why we sort the data into training and test segment. There is a confusing concept ‘validation set’, which will be studied in the next post.




The initially used data is purely for ‘train’, so now we need a new set of data! We click the box indicated in red and upload test.csv.



New dataset uploading complete! You can see that it’s test data, not training.

Since Pitney Bowes’s test data has been attached, DAVinCI LABS now takes a test with this new set of data. How would the results turn out?



It seems like training and test results aren’t identical. Such scene appears when the composition ratio of the same categorical field is different in the training data and the new data. Since data is randomly extracted from one file at an 8:2 proportion, it cannot be said that all data are uniform. It's not a troublesome deal, so we can skip this issue.




How sensible is it to get a comparison chart on the different statistical values of the initial version of the data!


After the test, how has the target field turned out? The predicted value of ‘Is_Fraud’ is put together underneath.


Number five calls for a fraud alert!


Can you spot the result of the 5th field, indicating a possibility of 1 in fraudulence? In this case, the service teams will straightaway deal with this particular order to reveal its underlying intentions. This is precisely how the predictive model detects a fraudulent order.


Weeps the fraud after failing his plan…


Ailys’s machine learning solution, DAVinCI LABS, has been put to an example use of how Pitney Bowes could have input the vast amount of data into the machine and to retrieve its wanted results. It’s very intuitive in the sense that isFraud=0 implies an issueless order, and 1 exclaims being a fraudulent one.


As long as the input and output are clearly set and the data quality is guaranteed, no problem at all at getting your wanted values!




Wrapping up, how has it been witnessing the implementation process of DAVinCI LABS’ AutoML solution factor?

Even if the analysis algorithms and their statistical concepts could not be understood, and even through the absence of professionalism, you could easily obtain 1. The correlation between different parameters with the correlation chart 2. A new predictive value, not to mention 3. the best analysis frame (algorithm) through DAVinCI LABS. Of course, there is a condition which is to prepare high-quality data with a clear definition of the x and y values.


I hope this has helped you get closer to understanding AutoML and its examples of fraud detection with Pitney Bowes’s case. The next segment will clarify the definition of train, validation, and test data!