Customer Churn – Logistic Regression with R


In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. In this blog post, we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

Learning/Prediction Steps


Data Description

Telecom dataset has the details for 7000+ unique customers, where details of each customer are represented in a unique row and below is the structure of the dataset: chrun_lr_dataframe Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning. chrun_lr_head

Data Preprocessing

  • Data cleansing and preparation will be done in this step. Transforming continuous variable into meaningful factor variable will improve the model performance and help understand the insights of the data. For example, in this dataset, the tenure interval variable is converted to factor variable with range in months. Thus, understanding the type of customers with tenure value to perform churn decision.
  • As part of data cleansing, the missing values are identified using the missing map plot. The telecom dataset has minimal number of missing value record and is dropped out from analysis.

churn_lr_na chrun_lr_missing_plot

  • Custom logic is implemented to create derived categorical variable from the tenure variable and continuous variables. As it will not affect the prediction value, customer id and tenure values are dropped from further process.


  • New categorical feature is created as mentioned above.


  • Few categorical variables have duplicate reference values and it refers to the same level. For example, “MultipleLine” feature has possible values as “Yes, No, No Phone Service”. Since “No” and “No Phone Service” have the same meaning, these records are replaced with unique reference.


Partitioning the Data & Logistic Regression

  • In the predictive modeling, the data need to be partitioned into train and test sets. 70% of the data will be partitioned for training purpose and 30% of the data will be partitioned for testing purpose.
  • In this dataset, 4K+ customer records are used for training purpose and 2K+ records are used for testing purpose.
  • Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can be used to predict chrun that are available in R or Python or Spark ML.
  • Multiple models can be executed on top of the telecom dataset to compare their performance and error rate to choose the best model. In this blog post, we have used Logistic Regression Model with R using glm package. Future blogs will focus on other models and combination of models.


Model Summary

From the model summary, the response churn variable is affected by tenure interval, contract period, paper billing, senior citizen, and multiple line variables. The importance of the variable will be identified by the legend of the correlated coefficients (*** – high importance, * – medium importance, and dot – next level of importance). Rerunning the model with these dependent variables will impact the model performance and accuracy. churn_lr_model

Prediction Accuracy

  • Models built using train datasets are tested through the test dataset. Accuracy and error rate are used to understand how these models are behaving for the test dataset. The selection of the best model is determined by using these measures.
  • Confusion Matrix/ Misclassification Table: It is a table used to describe the performance of the classification model on a test data. It is used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.


  • The various measures derived from the confusion matrix are:


  • With the choice of logistic regression, it is evident that the accuracy for this model is evaluated as 80% and error rate as 20%. The accuracy of the model can be improved with other classification models such as decision tree, and random forest with parameter tuning.

To read original post & the code downloads, click here

Views: 44934

Tags: predictive modeling


You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Purushothaman Gopalakrishnan on November 30, 2020 at 10:00pm

Hi, I am not able to download the codes and see the model diagrams. Can any help me please. 

Comment by Renaud Montes on February 21, 2019 at 4:38pm

Dead link :(

Comment by Alex Krolak on March 1, 2018 at 12:56pm


What Raghavan said is true in the context of a logistic regression model. Making a continuous variable into more similar "bins" helps the logistic regression algorithm pick out the riskier vs less risky bins. For example, if customers with low tenure and high tenure are high risk, but middle tenure are low risk, there's no way to model that relationship without cutting the variable into the 3 bins. This way the logistic regression can say each group has its own risk associated with it.

However, decision tree models do NOT typically benefit from discretizing the data's continuous features. In fact, this method typically makes the model worse - which is sometimes the price we pay for interpretability when using these types of models.

Comment by Raghavan Madabusi on August 10, 2017 at 3:47am


When we transform the continuous variable into meaningful factor variable it will help the business to identify the customers with which range of tenure interval people are churning from the business. In general tenure variable is affecting the model performance (churn prediction variable) when we use that as a continuous variable and as well as factor variable. 

Always the feature engineering with domain knowledge will improve the accuracy of the model and it will provide some insights to the data. With this data the factor with "0-6 month" tenure categorical feature has more importance than other tenure interval values.
I have used only tenure interval as feature for the prediction and we can run multiple model with feature combination and compare the accuracy of the model.
1) Model 1 - only with categorical tenure_interval 
2) Model 2 - only with continuous tenure 
3) Model 3 - with both categorical tenure_interval and continuous tenure.
Comment by Raghavan Madabusi on August 10, 2017 at 3:46am

@Nommel, Please click the link at the bottom of the post that says "read original post". The CM is located there

Comment by Nommel Djedjero on May 15, 2017 at 9:55am

Hey guys, I have been looking over and over and I can't find the confusion matrix.

Comment by Raghavan Madabusi on April 18, 2017 at 3:08pm


Can you double check your calculation? I'm getting the following:

Accuracy: 80%

Precision: 66%

Recall: 52%

F1: 0.58 (scale 2)

Comment by Alessandro Trinca Arnould on April 17, 2017 at 9:42pm

Very interesting

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service