What to Keep In Mind to Successfully Build a Good Machine Learning Model

22 Mar 2022

In the previous post “Basics of Machine Learning”, we briefed through what exactly machine learning meant and from this what types of models we could retrieve, not to mention what learning methods there were. In this session, we will focus on supervised modeling specifically and learn the points we have to acknowledge in order to develop a good model.

1.   Definition of model development

As we’ve learned earlier, #DAVinCI LABS is a solution that automates and optimizes the development of machine learning models. In that case, would we get an excellently performing machine learning prediction model once we upload any data on DAVinCI LABS?

To give the answer right away, it’s incorrect. As a matter of fact, we could say it’s half right and half wrong. How so?

Major crisis: which data should we use for the best modeling?

To figure out the reason, we should closely observe the term ‘development of machine learning model’. Although DAVinCI LABS may complete the process by training on data via algorithms, there are two major premises to achieve “applicability for prediction”: first of all, data should be perfectly set suited to its purpose, and the model should be evaluated upon and chosen appropriately for its initial purpose, in the right form.

Ultimately, prior to model development, there should be a precedent process of determining the purpose as well as constructing the basis- the action itself could be automated by DAVinCI LABS, but setting its purpose and the target is left upon the user’s hands. Of course, the act of automation itself is extraordinarily appealing considering that general machine learning requires data experts to code algorithms all the way.

2.  Process of model development

Target determination → Data preparation → Data feature-engineering → Optimization of algorithm → Model evaluation and interpretation → Model management

The development process of machine learning ranges from target determination to model management. The domain that DAVinCI LABS gets involved in varies from data feature-engineering to model extraction and its management.

In other words, we could say that it relies on human subjectivity when it comes to target determination, data preparation, and management- basically the domain outside automation. Decisions based on given data could be handled by AI, however, assessing the results and molding them into business strategies could not be executed except for humans.

The process mentioned above resembles the relationship between the teacher and the student. The student can train oneself by solving problems(data) and grasp the optimum solving method(algorithm), thus enhancing one’s skill (performance)- however, it relies solely on the teacher’s performance to set goals on what to teach, what the best morals should be, how to evaluate a student’s academic achievement and so on. Such roles are what DAVinCI LABS’ users are left to take charge of.


We should consider machine learning a student brilliant in data training and prediction.

In other words, we could confidently say the following:

The key to developing a predictive machine-learning model lies in the construction of a clear target

Within this context, the first step to developing an excellent machine-learning predictive model would be clearly setting the ‘training objective’ and the ‘definition of the subject’. We’re inclined to end up with a poorly trained model unless the model is precisely informed on what exactly it should learn and what specific result it should generate.

Before we develop the model, it’s mandatory that we clarify our subject. It could be customer churn rate or price prediction. The subject directly connects to the user’s domain and task comprehension. Only after accompanied by diverse background knowledge extracted from business experience and understanding on data could be clarified the available subject via containing data. Once the subject and the prediction target are clearly set, you could confidently say you charged across the first line to make a great prediction model.

Worthy results derive from worthy data

After determining the subject, we should decide which data to put into DAVinCI LABS. Would we be able to develop a brilliant prediction model right away, just by scraping in all containing data? Of course, the answer is no. It’s along the same context of training students on Korean, English, and Math when the student requires to focus on math only.

There’s a renowned saying related to machine learning: “Garbage in, Garbage out”. It means if you put useless data, you will retrieve only useless results. That’s precisely why an appropriate decision by the business user (“teacher/ director”) is required even in the process of data set establishment.

3.  Model Performance Evaluation

After DAVinCI LABS trains on the given data and develops a predictive model, now comes the need to decide which model would be appropriate and if it would align with the current objective.

DAVinCI LABS provides an apt performance index according to the type of given subject (target field). The performance is conveyed in numerical form, but its high and low do not determine the absolute standard of the performance index. That’s because the apt type of index changes depending on the model developing objective. Naturally, the user should refer to the recommended index for assessment of the initially set subject in order to completely achieve the prediction goal.

Let’s take an example. Under the assumption that DAVinCI LABS has developed a model predicting customer churn rates, there exist two complete models with performance almost similar to each other.

Confusion matrix of Model 1(left) & Model 2(right)

Above is the captured image of the two models’ #ConfusionMatrix. Confusion Matrix is a table comparing the correct and wrong decisions between the predicted and actual values of the model.

Model 1.

-Predicts the retentive customer correctly: 1437

-Predicts the retentive customer incorrectly: 147

-Predicts the churning customer incorrectly: 135

-Predicts the churning customer correctly: 250


Model 2.

-Predicts the retentive customer correctly: 1545

-Predicts the retentive customer incorrectly: 39

-Predicts the churning customer incorrectly: 207

-Predicts the churning customer correctly: 178

If you were the one in charge, which model should you select? If you were focusing on customer retention and churn prevention, Model 1 on the left would be the apt choice, which “makes predictions that are precise and in a large scale”.

However, what if we assume that the priority is to promote the most efficient method? Such situation derives from a limited marketing budget, thus having to focus on a small scale of customers. Model 1 predicted an estimate of 397 (147+250) customers and defined them as “alert customers (those likely to churn)”, resulting in 62.9% of accuracy (250/391=62.9% of marketing efficiency). On the other hand, Model 2 predicted 217(39+178) as “alert customers” and recorded 82% accuracy (178/217=82%). In this case of a limited budget, Model 2 would be the suitable choice.

As such, model assessment should accompany the business experience and subject goal of the user in order to achieve a good prediction model. Soon we’ll cover the meaning of diverse performance indexes and how to utilize them.

4.  Models age as well!

Now is the final step of model development: managing the completed model. It could be questioning in the sense that we still need the user's decision for already generated models. Why we need human interference is due to the fact that models and data turn old, just like us humans.


What has gotten into your mind, bringing ancient data in a monument?

Time is equal to everyone and it applies to machine learning models as well. Time ticks over the process of model development and management, resulting in changes in data characteristics and tendency when compared to the initial point of development. Certain models tend to maintain their initial prediction performance, but the majority gradually deteriorate in their predictive performance over time progress.

Therefore, business experts, in other words, the users, should decide which would be a good time point for retraining in order to maintain the model performance at its optimal state. Retraining means training the model with newly updated data, by reflecting recent data onto the existing training set.


Model performance drops over time

Frequent updates refine the model and so the model performance shoots back up!

Fortunately, DAVinCI LABS contains a model update function that enables regular performance management. As shown below, by setting the update cycle and time point, you can be notified with update alarms, so that the user can train the model with new data.


That’s all for today- we’ve gone over the core necessary points of developing a good model. We hope you’d look forward to our future posts where we would discuss data feature engineering and algorithm optimization!