A novel costsensitive framework for customer churn predictive modeling
 Alejandro Correa Bahnsen^{1}Email author,
 Djamila Aouada^{1} and
 Björn Ottersten^{1}
Received: 26 February 2015
Accepted: 13 May 2015
Published: 12 June 2015
Abstract
Customer churn predictive modeling deals with predicting the probability of a customer defecting using historical, behavioral and socioeconomical information. This tool is of great benefit to subscription based companies allowing them to maximize the results of retention campaigns. The problem of churn predictive modeling has been widely studied by the data mining and machine learning communities. It is usually tackled by using classification algorithms in order to learn the different patterns of both the churners and nonchurners. Nevertheless, current stateoftheart classification algorithms are not well aligned with commercial goals, in the sense that, the models miss to include the real financial costs and benefits during the training and evaluation phases. In the case of churn, evaluating a model based on a traditional measure such as accuracy or predictive power, does not yield to the best results when measured by the actual financial cost, ie. investment per subscriber on a loyalty campaign and the financial impact of failing to detect a real churner versus wrongly predicting a nonchurner as a churner.
In this paper, we present a new costsensitive framework for customer churn predictive modeling. First we propose a new financial based measure for evaluating the effectiveness of a churn campaign taking into account the available portfolio of offers, their individual financial cost and probability of offer acceptance depending on the customer profile. Then, using a realworld churn dataset we compare different costinsensitive and costsensitive classification algorithms and measure their effectiveness based on their predictive power and also the cost optimization. The results show that using a costsensitive approach yields to an increase in cost savings of up to 26.4 %.
Keywords
Background
The two main objectives of subscriptionbased companies are to acquire new subscribers and retain those they already have, mainly because profits are directly linked with the number of subscribers. In order to maximize the profit, companies must increase the customer base by incrementing sales while decreasing the number of churners. Furthermore, it is common knowledge that retaining a customer is about five times less expensive than acquiring a new one (Farris et al. 2010), this creates pressure to have better and more effective churn campaigns.
A typical churn campaign consists in identifying from the current customer base which ones are more likely to leave the company, and make an offer in order to avoid that behavior.
The typical churn campaign process starts with the sales that every month increase the customer base, however, monthly there is a group of customers that decide to leave the company for many reasons. Then the objective of a churn model is to identify those customers before they take the decision of defecting.
Using a churn model, those customers more likely to leave are predicted as churners and an offer is made in order to retain them. However, it is known that not all customers will accept the offer, in the case when a customer is planning to defect, it is possible that the offer is not good enough to retain him or that the reason for defecting can not be influenced by an offer. Using historical information, it is estimated that a customer will accept the offer with probability γ. On the other hand, there is the case in which the churn model misclassified a nonchurner as churner, also known as false positives, in that case the customer will always accept the offer that means and additional cost to the company since those misclassified customers do not have the intentions of leaving.
In the case were the churn model predicts customers as nonchurners, there is also the possibility of a misclassification, in this case an actual churner is predicted as nonchurner, since these customers do not receive an offer and they will leave the company, these cases are known as false negatives. Lastly, there is the case were the customers are actually nonchurners, then there is no need to make a retention offer to these customers since they will continue to be part of the customer base.
It can be seen that a churn campaign (or churn model) have three main points. First, avoid false positives since there is a financial cost of making an offer were it is not needed. Second, to the true positives, give the right offer that maximize γ while maximizing the profit of the company. And lastly, to decrease the number of false negatives.
From a machine learning perspective, a churn model is a classification algorithm. In the sense that using historical information, a prediction of which current customers are more likely to defect, is made. This model is normally created using one of a number of well establish algorithms (Logistic regression, neural networks, random forests, among others) (KhakAbi et al. 2010; Ngai et al. 2009). Then, the model is evaluated using measures such as misclassification error, receiver operating characteristic (ROC), Kolmogorov–Smirnov (KS) or F _{1} S c o r e statistics (Verbeke et al. 2012). However these measures assume that misclassification errors carry the same cost, which is not the case in churn modeling, since failing to identify a profitable or unprofitable churner have significant different financial costs (Glady et al. 2009).
In this paper we propose a new financial based measure for evaluating the effectiveness of a voluntary churn campaign taking into account the available portfolio of offers, their individual financial cost and probability of acceptance depending on the customer profile. Moreover, we compare stateoftheart classification algorithms, against recently proposed costsensitive algorithms such as Bayes minimum risk (Correa Bahnsen et al. 2014b), costsensitive logistic regression (Correa Bahnsen et al. 2014a), and costsensitive decision trees (Correa Bahnsen et al. 2015). Then using a realworld churn dataset we compare different costinsensitive and costsensitive predictive analytics models, using the traditional and proposed statistics. The results will show that using a costsensitive approach results in an increase in profitability of up to 26.4 %. Furthermore, the source code used for the experiments is publicly available as part of the CostSensitiveClassification (Correa Bahnsen 2015) library.
The remainder of the paper is organized as follows: The first section, we propose a new financial based measure for evaluating the effectiveness of a churn campaign. Then, we describe the different costinsensitive and costsensitive predictive analytics models. Afterwards, the experimental setup is given. Here the dataset, and its partitioning are presented. Finally the results and the conclusions of the paper are presented in the last two sections.
Evaluation of a churn campaign
Classification confusion matrix
Actual positive  Actual negative  

y=1  y=0  
Predicted Positive  True Positive (TP)  False Positive (FP) 
c=1  
Predicted Negative  False Negative (FN)  True Positive (TN) 
c=0 

Accuracy = \(\frac {TP+TN}{TP+TN+FP+FN}\)

Recall = \(\frac {TP}{TP+FN}\)

Precision = \(\frac {TP}{TP+FP}\)

\(F_{1}Score = 2\frac {Precision \cdot Recall}{Precision + Recall}\)
However, these measures may not be the most appropriate evaluation criteria when evaluating a churn model, because they tacitly assume that misclassification errors carry the same cost, similarly with the correct classified examples. This assumption does not hold in many realworld applications such as churn modeling, since when misidentifying a churner the financial losses are quite different than when misclassifying a nonchurner as churner (Glady et al. 2009). Furthermore, the accuracy measure also assumes that the class distribution among examples is constant and balanced (Provost et al. 1998), and typically the distributions of a churn data set are skewed (Verbeke et al. 2012).
When a customer is predicted to be a churner, an offer is made with the objective of avoiding the customer defecting. However, if a customer is actually a churner, he may or not accept the offer with a probability γ _{ i }. If the customer accepts the offer, the financial impact is equal to the cost of the offer (\(C_{o_{i}}\)) plus the administrative cost of contacting the customer (C _{ a }). On the other hand, if the customer declines the offer, the cost is the expected income that the clients would otherwise generate, also called customer lifetime value (C L V _{ i }), plus C _{ a }. Lastly, if the customer is not actually a churner, he will be happy to accept the offer and the cost will be \(C_{o_{i}}\) plus C _{ a }.
Proposed churn modeling exampledependent cost matrix
Actual positive  Actual negative  

y _{ i }=1  y _{ i }=0  
Predicted Positive  \(\phantom {\dot {i}\!}C_{{TP}_{i}}=\gamma _{i}C_{o_{i}}+(1\gamma _{i})({CLV}_{i}+C_{a})\)  \(\phantom {\dot {i}\!}C_{{FP}_{i}}=C_{o_{i}}+C_{a}\) 
c _{ i }=1  
Predicted Negative  \(\phantom {\dot {i}\!}C_{{FN}_{i}}={CLV}_{i}\)  \(\phantom {\dot {i}\!}C_{{TN}_{i}}=0\) 
c _{ i }=0 
This is consistent with the notion that if no model is used, the total cost would be the sum of the customer lifetime values of the actual churners, which gives the insight that the Savings measure is comparing the financial impact of the campaign of using a classification model against no using a model at all.
Customer lifetime value
where s _{ i,t } refers to the consumption of customer i during time period t, and μ refers to the average marginal profit by unit product usage.
Costsensitive classification
Classification in the context of machine learning, deals with the problem of predicting the class y _{ i } of a set of examples S, given their k variables, i.e. \(X_{i}=\left [{x_{i}^{1}}, {x_{i}^{2}},\ldots,{x_{i}^{k}}\right ]\). The objective is to construct a function f(S) that makes a prediction c _{ i } of the class of each example using its variables X _{ i }. Traditionally, predictive modeling methods are designed to minimize some sort of misclassification measure such as the F _{1} S c o r e (Hastie et al. 2009). However, this means assuming that the different misclassification errors carry the same cost, and as discussed before this is not the case in many realworld applications specifically in churn modeling.
On the other hand, the oversampling method consists in creating a new set S _{ o }, by making w _{ i } copies of each example i. However, costproportionate oversampling increases the training since S _{ o }>>S, and it also may result in overfitting (Drummond and Holte 2003). Furthermore, none of these methods uses the the full cost matrix but only the misclassification costs, which as described in the previous section, is not the case in churn modeling.
However, the aforementioned methods, only introduce the cost by modifying the training set. In (Correa Bahnsen et al. 2013, 2014b), a costsensitive model called Bayes minimum risk classifier (BMR) was proposed.
then the example i is classified as negative. This means that the risk associated with the decision c _{ i } is lower than the risk associated with classifying it as positive. However, when using the output of a binary classifier as a basis for decision making, there is a need for a probability that not only separates well between positive and negative examples, but that also assesses the real probability of the event (Cohen and Goldszmidt 2004), given that, the estimated probabilities are usually calibrated either by an isotonic regression, Platt regression, or the ROC convex hull methodologies (HernandezOrallo et al. 2012).
where \(h_{\theta }(X_{i})=g\biggl (\sum _{j=1}^{k}{\theta ^{j}{x_{i}^{j}}}\biggr)\) refers to the hypothesis of i given the parameters θ, and g(·) is the logistic sigmoid function, defined as g(z)=1/(1+e ^{−z }). To find the coefficients of the regression θ, the cost function is minimized by using binary genetic algorithms (Haupt and Haupt 2004).
Following the same objective of modifying an existing algorithm by introducing the different cost into its calculation, a costsensitive decision tree algorithm (Correa Bahnsen et al. 2015) was recently proposed. In this method a new splitting criteria is used during the tree construction. In particular instead of using a traditional splitting criteria such as Gini, entropy or misclassification, the Cost as defined in (4), of each tree node is calculated, and the gain of using each split evaluated as the decrease in total Savings of the algorithm.
Experimental setup
In this section we describe the dataset used to evaluate the different costinsensitive and costsensitive classification algorithms. Afterwards, we show the procedure used to estimate the probability of acceptance (γ _{ i }) of each customer. Lastly, the partitioning of the dataset is shown.
Database
For this paper we used a dataset provided by a TV cable provider. The dataset consists of active customers during the first semester of 2014. The total dataset contains 9,410 individual registries, each one with 45 attributes, including a churn label indicating whenever a customer is a churner. This label was created internally in the company, and can be regarded as highly accurate. In the dataset only 455 customers are churners, leading to a churn ratio of 4.83 %.
Offer acceptance calculation
In practice companies have a set of offers to make to a customer as a part of the retention campaign, they vary from discounts, to upgrades among others. In the particular case of a TV cable provided, the offers include adding a new set of channels, changing the TV receiver to one with new technology (ie. high definition, video recording, 4K), or to offer a discount on the monthly bill. Unsurprisingly, not all offers apply to all clients. For instance a customer that already has all the channels can not be offered a new set of channels. Moreover, an offer usually means an additional cost to the company and not all offers have not the same cost or the same impact in reducing churn.
Taking into account the cost and the implication of the offers, the problem can be resumed in making each customer the offer that will maximize the acceptance rate and more important reducing the overall cost.
Database partitioning
Description of datasets
Set  N  π _{1}  C _{0} 

Total  9,410  .0483  580,884 
Training (t)  3,758  .0505  244,542 
Validation  2,824  .0477  174,171 
Testing  2,825  .0442  162,171 
Undersampling (u)  374  .5080  244,542 
CS Rejectionsampling (r)  428  .4135  431,428 
CS Oversampling (o)  5,767  .03124  2,350,285 
Results
Results of the decision tree (DT), logistic regression (LR) and random forest (RF) algorithms, estimated using the different training sets: training (t), undersampling (u), costproportionate rejectionsampling (r) and cost proportionate oversampling (o)
Algorithm  Set  Savings  F _{1} S c o r e 

DT  t  0.0001 ± 0.0193  0.0750 ± 0.0199 
u  0.0370 ± 0.0603  0.1177 ± 0.0108  
r  0.0018 ± 0.0549  0.1200 ± 0.0129  
o  0.0249 ± 0.0203  0.1019 ± 0.0189  
LR  t  0.0001 ± 0.0002  0.0000 ± 0.0000 
u  0.0062 ± 0.0487  0.1227 ± 0.0097  
r  0.0500 ± 0.0372  0.1260 ± 0.0112  
o  0.0320 ± 0.0225  0.1088 ± 0.0199  
RF  t  0.0026 ± 0.0081  0.0245 ± 0.0148 
u  0.0424 ± 0.0547  0.1342 ± 0.0113  
r  0.1033 ± 0.0402  0.1443 ± 0.0127  
o  0.0205 ± 0.0161  0.0845 ± 0.0204 
Results of the decision tree (DT), logistic regression (LR) and random forest (RF) algorithms, estimated using the different training sets
Algorithm  Set  Savings  F _{1} S c o r e 

DT  BMR  t  0.0303 ± 0.0148  0.0946 ± 0.0158 
u  0.0574 ± 0.0387  0.1095 ± 0.0203  
r  0.0652 ± 0.0365  0.1151 ± 0.0169  
o  0.0306 ± 0.0149  0.0924 ± 0.0184  
LR  BMR  t  0.1058 ± 0.0319  0.1361 ± 0.0154 
u  0.0963 ± 0.0388  0.1319 ± 0.0166  
r  0.0823 ± 0.0364  0.1240 ± 0.0153  
o  0.0986 ± 0.0287  0.1333 ± 0.0149  
RF  BMR  t  0.0835 ± 0.0349  0.1252 ± 0.0151 
u  0.1300 ± 0.0368  0.1429 ± 0.0127  
r  0.1336 ± 0.0348  0.1429 ± 0.0132  
o  0.0907 ± 0.0359  0.1275 ± 0.0136 
Results of the costsensitive logistic regression (CSLR) algorithm, estimated using the different training sets
Algorithm  Set  Savings  F _{1} S c o r e 

CSLR  t  0.2418 ± 0.0859  0.1079 ± 0.0318 
u  0.1933 ± 0.0879  0.0908 ± 0.0055  
r  0.1971 ± 0.0897  0.0911 ± 0.0057  
o  0.2042 ± 0.0914  0.0917 ± 0.0060 
Results of the costsensitive decision tree (CSDT) algorithm, estimated using the different training sets
Algorithm  Set  Savings  F _{1} S c o r e 

CSDT  t  0.3062 ± 0.0338  0.1254 ± 0.0210 
u  0.1674 ± 0.0942  0.0922 ± 0.0063  
r  0.1931 ± 0.1002  0.0935 ± 0.0071  
o  0.2716 ± 0.1157  0.1002 ± 0.0102 
Conclusions
In this paper a new framework for a costsensitive churn predictive modeling was presented. First we show the importance of using the actual financial costs of the churn modeling process, since there are significant differences in the results when evaluating a churn campaign using a traditional such as the F1S c o r e, than when using a measure that incorporates the actual financial costs such as the savings. Moreover, we also show the importance of having a measure that differentiates the costs within customers, since different customers have quite different financial impact as measured by their lifetime value. Also, this framework can be expanded by using an additional classifier to predict the offer response probability by customer.
Furthermore, our evaluations confirmed that including the costs of each example and using an exampledependent costsensitive methods leads to better results in the sense of higher savings. In particular, by using the costsensitive decision tree algorithm, the financial savings are increased by 153,237 Euros, as compared to the savings of the costinsensitive random forest algorithm which amount to just 24,629 Euros.
Additionally, by testing the different exampledependent costsensitive classification methods, we observed that when the costs are included during the preprocessing stage, by using the costproportionate sampling methods, the savings are 60,005 Euros. On the other hand, when the costs are included after the training with the Bayes minimum risk algorithm, the savings are 77,606 Euros. Finally, by using the costsensitive decision tree algorithm, which include the costs during the training phase, the savings increase quite significantly to 177,867 Euros, hence, confirming the importance of using an algorithm that take into account the different exampledependent costs during the training phase.
Declarations
Acknowledgments
Funding for this research was provided by the Fonds National de la Recherche, Luxembourg.
Authors’ Affiliations
References
 Correa Bahnsen, A, Aouada, D, Ottersten, B (2015). ExampleDependent CostSensitive Decision Trees, Expert Systems with Application, in press. http://doi.org/10.1016/j.eswa.2015.04.042.
 Cohen, I, & Goldszmidt, M (2004). Properties and Benefits of Calibrated Classifiers. In Knowledge Discovery in Databases: PKDD 2004, Pisa, Italy, (pp. 125–136).Google Scholar
 Correa Bahnsen, A, Stojanovic, A, Aouada, D, Ottersten, B (2013). Cost Sensitive Credit Card Fraud Detection Using Bayes Minimum Risk. In 2013 12th International Conference on Machine Learning and Applications. IEEE, Miami, USA, (pp. 333–338).Google Scholar
 Correa Bahnsen, A, Aouada, D, Ottersten, B (2014a). ExampleDependent CostSensitive Logistic Regression for Credit Scoring. In 2014 13th International Conference on Machine Learning and Applications. IEEE, Detroit, USA, (pp. 263–269).Google Scholar
 Correa Bahnsen, A, Stojanovic, A, Aouada, D, Ottersten, B (2014b). Improving Credit Card Fraud Detection with Calibrated Probabilities. In Proceedings of the Fourteenth SIAM International Conference on Data Mining, Philadelphia, PA, (pp. 677–685).Google Scholar
 Drummond, C, & Holte, R (2003). C4.5, class imbalance, and cost sensitivity: why undersampling beats oversampling. In Workshop on Learning from Imbalanced Datasets II, ICML, Washington, DC, USA.Google Scholar
 Elkan, C (2001). The Foundations of CostSensitive Learning. In Seventeenth International Joint Conference on Artificial Intelligence, (pp. 973–978).Google Scholar
 Farris, PW, Bendle, NT, Pfeifer, PE, Reibstein, DJ. (2010). Marketing Metrics: The Definitive Guide to Measuring Marketing Performance, 2nd, (p. 432). New Jersey, USA: Pearson FT Press.Google Scholar
 Glady, N, Baesens, B, Croux, C (2009). Modeling churn using customer lifetime value. European Journal of Operational Research, 197(1), 402–411.View ArticleGoogle Scholar
 Hastie, T, Tibshirani, R, Friedman, J (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction.Google Scholar
 Haupt, RL, & Haupt, SE. (2004). Practical Genetic Algorithms, Second edition, (p. 261). New Jersey: John Wiley & Sons, Inc.Google Scholar
 HernandezOrallo, J, Flach, P, Ferri, C (2012). A Unified View of Performance Metrics : Translating Threshold Choice into Expected Classification Loss. Journal of Machine Learning Research, 13(July), 2813–2869.Google Scholar
 Correa Bahnsen, A (2015). CostSensitiveClassification Library in Python. http://dx.doi.org/10.5281/zenodo.17789.
 KhakAbi, S, Gholamian, MR, Namvar, M (2010). Data Mining Applications in Customer Churn Management. In 2010 International Conference on Intelligent Systems, Modelling and Simulation, Liverpool, UK, (pp. 220–225).Google Scholar
 Marslan, S. (2009). Machine Learning: An Algorithmic Perspective. New Jersey, USA: CRC Press.Google Scholar
 Milne, GR, & Boza, ME (1999). Trust and Concern in Consumers’ Perception of Marketing Information Management Practices. Journal of Interactive Marketing, 13(1), 5–24.View ArticleGoogle Scholar
 Neslin, SA, Gupta, S, Kamakura, W, Lu, J, Mason, CH (2006). Defection Detection : Measuring and Understanding the Predictive Accuracy of Customer Churn Models. Journal of Marketing Research, 43(2), 204–211.View ArticleGoogle Scholar
 Ngai, EWT, Xiu, L, Chau, DCK (2009). Application of data mining techniques in customer relationship management: A literature review and classification. Expert Systems with Applications, 36(2), 2592–2602.View ArticleGoogle Scholar
 Pfeifer, PE, Haskins, ME, Conroy, RM (2004). Customer lifetime value, customer profitability, and the treatment of acquisition spending. Journal of Managerial Issues, 17(1), 11–25.Google Scholar
 Provost, F, Fawcett, T, Kohavi, R (1998). The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning, (pp. 445–453).Google Scholar
 van Raaij, EM, Vernooij, MJA, van Triest, S (2003). The implementation of customer profitability analysis: A case study. Industrial Marketing Management, 32, 573–583.View ArticleGoogle Scholar
 Verbeke, W, Dejaeger, K, Martens, D, Hur, J, Baesens, B (2012). New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research, 218(1), 211–229.View ArticleGoogle Scholar
 Verbraken, T (2012). Toward profitdriven churn modeling with predictive marketing analytics. In Cloud Computing and Analytics: Innovations in Ebusiness Services. Workshop on EBusiness (WEB2012), Orlando, US.Google Scholar
 Verbraken, T, Verbeke, W, Baesens, B (2013). A novel profit maximizing metric for measuring classification performance of customer churn prediction models. IEEE Transactions on Knowledge and Data Engineering, 25(5), 961–973.View ArticleGoogle Scholar
 Wang, T. (2013). Efficient Techniques for CostSensitive Learning with Multiple Cost Considerations. Sydney: PhD thesis, University of Technology.Google Scholar
 Zadrozny, B, Langford, J, Abe, N (2003). Costsensitive learning by costproportionate example weighting. In Third IEEE International Conference on Data Mining, (pp. 435–442).Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.