Predicting Customer Churn

14 Mar 2022

What is Kaggle?

Kaggle is a contest platform of predictive model development and data-analysis, established in 2010 and taken over by Google in 2017 March. Companies and organizations enter the data and problems to be solved on Kaggle, and data scientists from all over the world build models to solve these and compete with each other.

The cost of securing a new customer depends on the industry, company and service, but usually it takes around 5 to 25 times more than how much goes into maintaining the existing customers. It sounds plausible once you consider the amount of time and resource that are invested to searching for new users. In that case, would securing new customers be the only important task? Nowadays many companies are heavily keen on keeping the already existing customers as well as introducing new people. According to the research executed by Bain & Company, 5% rise in customer retention leads to 25%~ 95% of sales increase.

What is customer churn rate?

It’s as just as important to keep the existing customers as much as taking action for customer acquisition is. The common index used to check how well the company retains its customers is called ‘churn rate’. Churn rate is an index that digitizes the connection between the customer and the company during a certain period. Usually, it is measured quarterly or annually, and companies that appear to be especially sensitive to customer churn and the quickly changing churn rates check it monthly.

Recontract, repurchase and the development process of becoming a loyal customer are important parts of the company’s marketing strategies. However, it’s hard to run a marketing event or a promotion frequently jus to satisfy the customers. It’s therefore crucial to distinguish the customers who are likely to discontinue transactions with the company not to mention those who are likely to avert their eyes to competing companies once advantages in cost and service are removed- in other words, those who appear highly possible to churn- and then execute the apt marketing strategies towards them. It’s similar to determining a prescription on a disease, based on numerous symptoms and then curing it with the minimum amount of medicine.

Churn modeling starts right here, where you can figure out the customer type likely to move away by analyzing the churn rate segmented into different customer levels. Also, you can develop to a level of how the customer’s move would turn out, not just comprehending current customer behavior. This refers to pre-identifying who would leave and when. With this process companies can acknowledge whether the ongoing marketing strategy would be suitable or not, and how to change the approach towards the customers, as well as take actions based on these.

Predicting the future is a powerful, charming action. However, you should contemplate when it comes to completely trusting the prediction of the churn model. Churn rate means the ratio of the customers who have churned out of the whole customers within a set period. In other words, certain amount of time must pass in order to identify the churning rate and the churn-or-not factor. We call this the ‘Lagging indicator’ and we’ll elaborate on points to watch out for when we use it later on.

Bank customer churn modeling with DAVinCI LABS


Data includes information of bank customers and their churning status within a certain time period, consisting of 14 variables and 10,000 samples in total. The ‘Target’, the variable to be predicted, is identified as “Exited”- 0 means ‘retention’, while 1 means ‘churn’.

  • Numeric Variables (7): Age, Balance, CreditScore, EstimatedSalary, NumOfProducts, Tenure, RowNumber
  • Categorical Variables (6): Gender, Geography, HasCrCard, IsActiveMember, CustomerId, Exited
  • Text Variables (1): Surname


A. Excluded Variable

The core point of model training process is finding the intrinsic patterns within data. That is, deciphering the tendency and the common patterns of the ‘Target’ and the ‘Feature(rest of variables)’. This also means that ‘Features’ should be excluded, as they carry different values depending on separate samples. Or else, the model would be trained on the 1:1 ratio of Feature and Target values, resulting a mislead outcome. Therefore, we will exclude ‘Unique Variables: CustomerId, RowNumber, Surname’ that hold their own characteristic value, and get onto modeling.

B. Check data distribution

The target (Exited) rate equals 0.2037, letting us know that out of 10,000 customers, 2037 have churned in the following period.

Figure 1. Target histogram

Figure 1. Target histogram

The correlation between the ‘Feature’s has been noted to be almost non-existent. Out of numeric variables, ‘NumOfProducts’ and ‘Balance’ have shown to have the highest correlation, but they measure up to only about -0.3.

Figure 2. Correlation matrix of numeric variables(left) & categorical variables(right)

Figure 2. Correlation matrix of numeric variables(left) & categorical variables(right)

‘Age’, variable with the highest correlation with ‘Target (Exited)’, shows the age group of 40~70 holding the churn rate above average among the customer group. (The red dotted lines indicate the average of the ‘Target’, the red lines are the average of the ‘Target’ within the indicated section. The histogram marks the sectional count of the ‘Age’ variable.)

Figure 3. Target vs. Age graph

Figure 3. Target vs. Age graph


Before the modeling, the whole dataset is randomly segmented into a train set and the validation set in the ratio of 8:2, then run on 5-fold cross-evaluation. The respective numbers of the train and the validation set are 8031 and 1969, the respective target rates are 20.57% and 19.55%, showing us that they are divided evenly in general.

Figure 4. Data split setting

Modeling is executed on two standards, ‘Accuracy’ and ‘F1-score’. It starts training on 9 algorithms ( Parameter search). You can view the following results of modeling according to different standards from the picture below.

  • Model 1. Gradient Boosting Machine (F1-score : 0.6394)
  • Model 2. Random Forest (Accuracy : 0.8751)

Figure 5. Top 3 models trained by f1-score(left) & by accuracy(right)

Figure 5. Top 3 models trained by f1-score(left) & by accuracy(right)

There is no right and wrong to selecting the training standard and the ultimate model. However, it is mostly up to those in charge of model prediction, industry-and-service-wise to decide, and it is mandatory that those decisions be made on current business’ background knowledge. Learn through the following picture how to interpret the modeling results and how to make apt decisions.

Figure 6. Confusion matrix of Model 1(left) & Model 2(right)

Figure 6. Confusion matrix of Model 1(left) & Model 2(right)

The first picture shows the Confusion Matrix of Model1(GBM) and the second displays that of Model2(RF). Confusion Matrix is a chart that compares the right and wrong of the model’s prediction and the actual result, in forms of 0 or 1.

Model 1.

-Predicts the retentive customer(target=0) correctly(pred=o): 1437

-Predicts the retentive customer(target=0) incorrectly(pred=1): 147

-Predicts the churning customer (target=1) incorrectly(pred=0): 135

-Predicts the churning customer(target=1) correctly(pred=1): 250

Model 2.

-Predicts the retentive customer(target=0) correctly(pred=o): 1545

-Predicts the retentive customer(target=0) incorrectly(pred=1): 39

-Predicts the churning customer (target=1) incorrectly(pred=0): 207

-Predicts the churning customer(target=1) correctly(pred=1): 178

If you were the one in charge, which model should you select? If the case focuses on retention of customers and prevent churns, it would be apt to choose Model 1(Gradient Boosting Machine) as it makes the most accurate prediction on the biggest number of customers. However, to remind you once again, it doesn’t mean there is always a set answer for these decision-making processes. For example, let’s assume a situation in which the marketing budget is limited thus there’s an apparent need for the most efficient promotion, targeting a rather tiny group of customers. Model 1 marked 62.9% in accuracy, predicting 250 out of 397 churning customers, and Model 2 had a higher accuracy, recording 82%, completing a prediction of 178 out of 217. In this case, it obviously seems suitable to select Model 2 (Random Forest) for the promotion. As shown, decisions on models should be flexibly made after considering experiences and circumstantial decisions by the related people.

Understanding what our model says

It’s crucial to be aware of how the completed model works, and how it generates prediction values. It’s even more critical in the financial industry that runs assessments on customers’ risks and bankruptcy possibilities, for example in form of Churn modeling, not to mention evaluations for acceptance or denial of contracts. The most widely used algorithm that has been universally incorporated for a long duration in the financial assessment model is the ‘Regression Model (Linear, Logistic, GLM etc)’. The reason is that you are provided with clear information on the input variable (X_p) and its importance value (beta_p). Those who utilize models, the credit rating agencies, could scrape this source of information to make decisions on the model (predictions).

However, it’s a different story when it comes to machine learning (deep learning) models as they differ from traditional statistics models. (Of course, regression model is within the machine learning category and a considerable number of machine-learning models have been developed based on this particular type of model.) The difference is in the accuracy, which is the very reason people deliberately seek for machine learning models. In other words, they want to solve problems better and generate higher accuracy. In that sense, it would be desirable to utilize not just 5 variables but 10, 100 or even more to achieve their goal. Technological development has enabled such approach to become possible. It allowed accumulation of massive data and automation of repetitive tasks at a higher speed. Through this, prediction models have started to accommodate data up to an extent unachievable at the past and search for inner patterns. That is, models have drastically improved in complexity.

There are both pros and cons to these complex models. You are granted with higher prediction accuracy in the expense of difficulty in explanation- why and how the model has made such a prediction. It’s not rocket science to acknowledge that 100 variables would cause more confusion in their correlation than 5 would. A model trained on such vast data solves problems beyond the limits of human cognition. Despite all this, the core reason to why the demand for machine learning doesn’t cease to increase is the advantage in accuracy overriding all the drawbacks. Nevertheless, there are domains like finance, where credit scoring and assessments require comprehension and explanation on the predictions. Due to this, lots of effort is invested to the model so that these domains could incorporate machine learning models without too much difficulty.

(Personal comment by the writer: People sometimes don’t necessarily go over a decision thoroughly and just proceed with it. It’s intuition. Following one’s intuition doesn’t exactly mean skipping the contemplative process; instead, it refers to an extremely speedy execution of analysis and inference within our unconsciousness, based on the accumulated experience and knowledge. From that perspective, prediction made by machine learning models would be the computer’s intuition.)


DAVinCI LAB’s clustering function has a distinctive point that differs from that of a typical clustering model. The general clustering model we know of doesn’t contain a target value, but only explanatory variables in the data. The data is clustered into respectively similar criteria. (Unsupervised Learning, MECE; Mutually Exclusive Collectively Exhaustive) . However, DAVinCI LAB’s clustering function includes the target in data and determines groups that don’t have specific correlations to the target value but still have a certain importance (Supervised learning, Not MECE).

Cluster 1. (+220% compared to the target average, no. of sample: 631)

— Geography variable: ‘Germany’

— Age variable: from and above 44.5, under 66.5

Cluster 2. ( -72.1% compared to the target average, no. of sample: 599)

— Geography variable: ‘Germany’

— NumOfProducts variable: from and above 1.5, under 2.5

— Age variable: under 38.5

Figure 7. Cluster result

Figure 7. Cluster result

Cluster 1 shows the following: the “Geography variable”, indicating the location, is ‘Germany’, and the age variable ranges approximately from 45 to 66, identifying a 220% increased va(0.65) in churn rates compared to that of the average (0.2). On the contrary, while Cluster 2 also holds ‘Germany’ as “Geography variable” as well, the age variable turns out to be significantly lower (‘under 38’), and the customers with 2 bank accounts show 72%(0.057) lower churn rates compared to that of the average.


A. Variable Importance

Through ‘Variable importance’, you can figure out the commonly used input variable incorporated in the training process. Model 1(GBM) proves its common variables to be Age(25%), Balance(23%), and NumOfProducts(23%).

Figure 8. Variable importance result

Figure 8. Variable importance result

B. Prediction Explanation

You can check how much impact a certain variable has on the prediction of the model through ‘Explanation’. Out of 2000 upper samples(=samples likely to predict the ‘class’-churning customer- as 1) based on prediction values, the most impactful variables appear to be ‘Age(57%)’, ‘Balance(31%)’, and ‘NumOfProducts(1%)’.

Figure 9. Prediction explanation of class 1

Figure 9. Prediction explanation of class 1

For a more detailed context, let’s take a separate look at the ‘Age’ variable. Picture (a) determines the correlation of the ‘Age’ and the model prediction. Approximately ’40 to 70’ age groups seem likely to churn, meaning that the model prediction graph rises (setting the predicted possibility to churn particularly high). Also, picture (a) has similar movements to those of ‘target (Actual value) and Age variable graph (Figure 3)’ from DATA INSIGHT part of this post. The more similar the two graphs appear to be, the stronger the model proves to have been well-trained.

Through Picture (b) you can assume the impact of the ‘Age’ variable on the model prediction mediately if not directly. The bigger the deviation of predictions on the Age range, the stronger the impact turns out to be. (The impact is proportionate to the area of the colored part. You then reflect the weighted value of the age distribution to calculate the impact. The green part indicates high possibility of churning, while the red part means the opposite.)

Figure 10. a) Prediction vs. Age graph(left), b) Visualizing deviation of Age variable(right)

Figure 10. a) Prediction vs. Age graph(left), b) Visualizing deviation of Age variable(right)

The pictures below hold graphs that intuitively capture that the ‘Age’ variable is highly impactful to the model prediction. The prediction distribution (a) of ‘Age’ reveals the clear distinguishment of the target class (Retention=0, Churn=1). To digest this information more easily, you can examine the prediction distribution of ‘Balance’ © and recognize that its distribution is almost identical in al of its respective classes. This indicates that the ‘Age’ has a higher impact towards target prediction than ‘Balance’ does. Also, observing the distribution of predicted values (a) and the actual values (b), you can see that they match almost perfectly. Such similarity implies that the developed model has been well-trained, correspondent to the actual data distribution (the same principle applies to c and d).

Figure 11. a) Distribution of Age by prediction classes(top left), b) distribution of Age by gt classes(top right), c) distribution of Balance by prediction classes(bottom left), d) distribution of Balance by gt classes(bottom right)

Figure 11. a) Distribution of Age by prediction classes(top left), b) distribution of Age by gt classes(top right), c) distribution of Balance by prediction classes(bottom left), d) distribution of Balance by gt classes(bottom right)

Things you need to concern when using churn model

‘Churn rate’ is a numerical figure that is generated by dividing the number of churned customers within a set period by the number of the total customers. It’s a ‘Lagging indicator’. This means that it is retrievable and measurable only after a phenomenon or a certain trend has been pursued. What exactly does this mean, and what should we keep in mind when utilizing the churn rates?

BUSINESS SIDE: Noting that index is not just a numerical figure but a means to realize the past activities and future opportunities

Many companies consider ‘Churn rate’ as a mere numerical value and suppose that there is a set optimum value. However, as we can see from the method of churn rate calculation, the temporal point of checking this value is at least six to one year after the customer churn has appeared. Even if churn rate becomes the only measured index of customer management and marketing strategy, it cannot be reflected to the future model straight away as it requires six months of time.

Therefore, the questions managers in charge to ask themselves are: What would be the cause to the rise in customer churn? What would be the reasons for the clients to leave the company? What measures should we take in terms of customer management to prevent churns? Once we analyze and closely examine the index to retrieve insight, we shall be able to come up with a list of things to do for index improvement.

MODELING SIDE: Preventing look-ahead bias

The most common mistake in churn modeling is Look-ahead bias. It’s a universal problem among all models that are used to predict future events, not just in churn modeling. The problem occurs when you need variables for the dataset that are unattainable at that time being (actual management process). For example, let’s say a model predicting September’s customer churn has been developed in July (both in 2018). Considering that “Customer’s Monthly Average Expenditure” is included as a variable in the dataset, we have to keep in mind not to calculate the monthly expenditure over the lifetime. What we specifically need is the expense data before 2018. If we miss this point, we only end up getting the results from what we already know (“Look-ahead”), leading to overly high performance generated from the data collection level. This is precisely why we have to be aware of “Look-ahead bias” possibilities during the data collection process; once we deal with this issue, the completed model will run smoothly at the actual predictive point.


  • The Value of Keeping the Right Customers (Harvard Business Review, 2014)
  • Prescription for cutting costs (BAIN & COMPANY, 2001)
  • Predicting Churn: How Data Can Help with Customer Retention (DataRobot, 2019)


  • Bank Customer Churn Modeling (Kaggle, 2018)

Chart code