BLOG

Very Simply Explained ‘Train, Validation, and Test’.

루이
6 Jun 2022


Fitness is a major trend among all generations.


 

 Feat. “Training!”


Training is part of the mainstream culture now. A large portion of social media posts is taken up by the young and the old eagerly investing in making a fit figure ( In fact, I believe such movement derives from the urge of wanting to make the slightest change of which the result is guaranteed, in this extremely competitive, demanding society. )

I figure that the importance of training doesn’t just speak up in the machine learning world.

Among the first encounters with the concept of ‘train’ and ‘test’ set, there are possible misunderstandings with the definition of train (as a newbie in the machine learning world, I’d initially mistaken the word ‘train’ as the vehicle…)

It’s mandatory that machines go through training procedures and consequently enhance their performance to complete a model.

Train set and Test set are needed in this process. Evaluation of the training is an essential routine.

If so, let’s see what Train and Test indicate through examples. We’ll also examine how exactly the validation process operates along with a key concept ‘k-fold’.




Train set: Must-have data to start with

80% is for training!


As already mentioned in the posts about Lending Club and Pitney Bowes, data is divided into train and test set for modeling. The default ratio is ‘train: test = 80:20’, with the training set taking up most of the data. In common sense, we prepare for exams for a long time with textbooks and working materials, but the test happens in a matter of a few hours within some pages.

Train set gives us a guide of certain equations, notifying a form ‘a leads to b’. It’s equivalent to a textbook that explains basic concepts and rules, acting as a basis for model development.



Here is an example that conforms to the current situation of COVID-19.

You’ve gone out and come back home to find out that you feel a mild cold. You also feel pretty chilly, so you head straight to the clinic to get checked for COVID-19. The next day the result informs you with the relieving information that you appear as ‘negative’ concerning the virus.

Symptoms of COVID-19 vary: stomachache, chills, fever, loss of taste, headache, cough, and many others, whereas some cases show absolutely no sign of suffering. Just because you feel sick, it cannot directly translate to having the COVID-19 virus, and vice versa.

This is where the train dataset sets the correlation between the physical symptoms and the virus.

Here’s an example of COVID-19 related data. You are ‘B’.



COVID-19 symptoms dataset.csv



The values consist of only 0 and 1 in this chart. Having 1 as the value indicates that the subject shows the addressed symptom.

The blue column is the indicator of whether the subject has the COVID-19 virus or not. It’s the ultimate information we aim to get.

In other words, the target feature is ‘is_Positive’, the values of the y-axis.

If the virus is that of COVID-19, the result appears as 1 (Positive), while it isn’t the case, you get 0 (Negative).

Assuming that this is the entire dataset if the machine learns every result of the subject, it will run out of the test data. It’s basically absorbing and memorizing even the parts for the test and its answers.

Therefore, we have an evident need to split the dataset into two chunks to save the portion for the test in advance to the training point.



Train Set.csv



80% of the set is used for train data, for studying the input and output of four subjects, and the 20% is saved separately for the ultimate test. The machine forms its knowledge relying on these four subjects (A, B, C, and D) and leaves out E’s data so that it could take a guess itself on whether E would hold a COVID-19 positive virus.


 

Test Set: Saved Data for Performance Evaluation

20% is for testing the machine




Test Set.csv



This small set is the leftover 20% of data, soon to be used for the test.

E is spotted to have all the given symptoms: cough, fever, loss of taste, stomachache, and headache. The question addressed by the test set is if E would carry the virus-positive to COVID-19.

The machine will face the problem with a blank on where the value of is_Positive should go in and give an answer in either 0 or 1. Then the result is given to see if the answer conforms to the real answer. (The test was way too easy, wasn’t it? It’s ‘is_Positive = 1’. E has all symptoms the illness could carry.)

Moving on, here is an addressed issue: what would be the pure possibility of getting high marks on SAT just by being entirely dependent on the textbook?

Surely there should be additional studying done with mock-tests, to ensure that the student has fully understood the learned content. Counting on the extreme rareness of those who have only studied with textbooks yet have achieved amazing test results, there should be mock tests to keep up with how well the student has been doing and where improvements could be made.

This is where the ‘validation set’ makes its appearance.



Validation Set

Provides chances to evaluate the performance in advance of the actual test



This is a more polished procedure- it doesn’t interfere with the training progress itself.

The validation set complements the gap between the Train and the Test set. Rather than teaching new things to the machine, it validates the learned content and evaluates the model’s performance.

If the ultimate big test is awaiting you at the end of the training without a single assessment on the way, you wouldn’t have any more problems to solve unless you are suddenly granted the luck of getting an extra set of exam problems to practice with from nowhere. A validation set exists to prevent this issue and enrich the training procedure.



Ailys’s #DAVinCI LABS automatically sets apart the developing data from validation data in an 8:2 ratio. There is no need to deliberately part the train data by yourself.



Train and validation data separated in an 80:20 ratio @ DAVinCI LABS


Validation data and test data should never clash, which is why data of A, B, C, and D are once more separated to secure their division.  Usually, the 80:20 ratio is equally applied to train and validation data, but for convenience on this particular dataset, we’ll set it to 60:20.


We will take away a patch for validation data from this Train Set.csv



Validation Set.csv

Train Set(new).csv



The amount of train data for the moment will be reduced but through validation, we are able to build a more elaborate model, which is ultimately a more positive outcome. This is called data segmentation, or data split.



K-fold Cross Validation is a representative method of validation


Once you fold a piece of paper K-1 times, you get K spaces, so we name this ‘K-fold’.

K-fold Cross Validation is the most typically used method in validation.

It’s splitting the validation data into ‘k’ chunks, and then running a test on a ‘fold’ out of the chunk and then training the rest of the fold. This is used to secure uniformity in the source. It enables effective validation and upgrading of data.

DAVinCI LABS solution also practices K-fold cross-validation.



 

The K-fold is featured in a tiny box- the default number of folds is 3.


How k-fold is practiced: data is split into ‘k’ chunks and trains k times. It’s a method with balance and uniformity.

Source: https://vvnn.tistory.com/m/5


This example has set the value of k as 5. The data is split into five folds. Then out of these five folds, one is chosen for a separate test, while the rest remains for training. This process frame is repeated five times, the folds switching in turn every time. In other words, the dataset’s samples could be tested equally.


Ultimately, the mean of the model performance is calculated thus generating the total performance value.

There are two main benefits to this process.

First, overfitting, overly training the model with data, is prevented.

If only train data is used for the studying process and then the next step straightly leads to the official test, it is likely for the model to be only well-informed on train data while being absolutely clueless on the actual test, causing errors.

After all, train data is a mere part of the entire dataset, which means being overly well-aware of the portion could drop the overall accuracy and the performance of the model. It’s equivalent to falling into a trap of focusing too much on the details and missing the whole picture in the end. Therefore, we add validation data to run an assessment in advance of the test.

Second, you can estimate the correct generalization performance of your model. It offers a preliminary step for verifying the performance before jumping right into testing. The model becomes more sophisticated as a result. We already know that the result of 'D' in the validation set is 0 (negative), but we erase this out so we can reflect the model’s accuracy.



Validation Set.csv




The best scenario is undergoing the entire process of Train, Validation, and Test

It’s just that it’s slightly more tedious



 

Source: https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets


Both methods A and B are used for machine learning modeling. However, it’s more suitable for the improvement of model performance to select B, where the validation process is included.

Normally we should manually edit the dataset to sort into train and test data. Under the assumption 100 is the entire worth of the data, 80 is to be secured for data training, whereas the 20 is kept for the test, not to mention having to do this task by hand.

On top of this, undergoing a separate sorting process for validation would cause even more exhaustion. DAVinCI LABS automatically divides development data and validation data used only for pure learning 80 to 20 by uploading the train data. In other words, you only need to partition the train-test data once.

 


DAVinCI LABS handles the rest of model training and validation for you, once you just upload the train data.





Has this segment helped with your approach to the train, validation, and test set?


Training is for pure learning,

Validation is to review the learned contents and check the performance,

Tests exist to test the final capabilities of a developed-verified model.



Training and verification are often mixed up, so we focused on the basics of the data-splitting concept this time. I hope this helps you build basic knowledge of machine learning.