Subscribe to DSC Newsletter

Customer Churn – Logistic Regression with R

Overview

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. Customer loyalty and customer churn always add up to 100%. If a firm has a 60% of loyalty rate, then their loss or churn rate of customers is 40%. As per 80/20 customer profitability rule, 20% of customers are generating 80% of revenue. So, it is very important to predict the users likely to churn from business relationship and the factors affecting the customer decisions. In this blog post, we are going to show how logistic regression model using R can be used to identify the customer churn in the telecom dataset.

Learning/Prediction Steps

churn_lr_model_diagram

Data Description

Telecom dataset has the details for 7000+ unique customers, where details of each customer are represented in a unique row and below is the structure of the dataset: chrun_lr_dataframe Input Variables: These variables are called as predictors or independent variables.

  • Customer Demographics (Gender and Senior citizenship)
  • Billing Information (Monthly and Annual charges, Payment method)
  • Product Services (Multiple line, Online security, Streaming TV, Streaming Movies, and so on)
  • Customer relationship variables (Tenure and Contract period)

Output Variables: These variables are called as response or dependent variables. Since the output variable (Churn value) takes the binary form as “0” or “1”, it will be categorized under classification problem in the supervised machine learning. chrun_lr_head

Data Preprocessing

  • Data cleansing and preparation will be done in this step. Transforming continuous variable into meaningful factor variable will improve the model performance and help understand the insights of the data. For example, in this dataset, the tenure interval variable is converted to factor variable with range in months. Thus, understanding the type of customers with tenure value to perform churn decision.
  • As part of data cleansing, the missing values are identified using the missing map plot. The telecom dataset has minimal number of missing value record and is dropped out from analysis.

churn_lr_na chrun_lr_missing_plot

  • Custom logic is implemented to create derived categorical variable from the tenure variable and continuous variables. As it will not affect the prediction value, customer id and tenure values are dropped from further process.

chrun_lr_custom_logic

  • New categorical feature is created as mentioned above.

churn_lr_feature_head

  • Few categorical variables have duplicate reference values and it refers to the same level. For example, “MultipleLine” feature has possible values as “Yes, No, No Phone Service”. Since “No” and “No Phone Service” have the same meaning, these records are replaced with unique reference.

churn_lr_categorical_var

Partitioning the Data & Logistic Regression

  • In the predictive modeling, the data need to be partitioned into train and test sets. 70% of the data will be partitioned for training purpose and 30% of the data will be partitioned for testing purpose.
  • In this dataset, 4K+ customer records are used for training purpose and 2K+ records are used for testing purpose.
  • Classification algorithms such as Logistic Regression, Decision Tree, and Random Forest can be used to predict chrun that are available in R or Python or Spark ML.
  • Multiple models can be executed on top of the telecom dataset to compare their performance and error rate to choose the best model. In this blog post, we have used Logistic Regression Model with R using glm package. Future blogs will focus on other models and combination of models.

churn_lr_train_test

Model Summary

From the model summary, the response churn variable is affected by tenure interval, contract period, paper billing, senior citizen, and multiple line variables. The importance of the variable will be identified by the legend of the correlated coefficients (*** – high importance, * – medium importance, and dot – next level of importance). Rerunning the model with these dependent variables will impact the model performance and accuracy. churn_lr_model

Prediction Accuracy

  • Models built using train datasets are tested through the test dataset. Accuracy and error rate are used to understand how these models are behaving for the test dataset. The selection of the best model is determined by using these measures.
  • Confusion Matrix/ Misclassification Table: It is a table used to describe the performance of the classification model on a test data. It is used to cross-tabulate the actual value with the predicted value based on the count of correctly classified customers and wrongly classified customers.

chrun_lr_cf_basics

  • The various measures derived from the confusion matrix are:

churn_lr_cf_derive

  • With the choice of logistic regression, it is evident that the accuracy for this model is evaluated as 80% and error rate as 20%. The accuracy of the model can be improved with other classification models such as decision tree, and random forest with parameter tuning.



To read original post & the code downloads, click here

Views: 14152

Comment

You need to be a member of Data Science Central to add comments!

Join Data Science Central

Comment by Raghavan Madabusi on August 10, 2017 at 3:47am

@Davide,

When we transform the continuous variable into meaningful factor variable it will help the business to identify the customers with which range of tenure interval people are churning from the business. In general tenure variable is affecting the model performance (churn prediction variable) when we use that as a continuous variable and as well as factor variable. 

Always the feature engineering with domain knowledge will improve the accuracy of the model and it will provide some insights to the data. With this data the factor with "0-6 month" tenure categorical feature has more importance than other tenure interval values.
I have used only tenure interval as feature for the prediction and we can run multiple model with feature combination and compare the accuracy of the model.
1) Model 1 - only with categorical tenure_interval 
2) Model 2 - only with continuous tenure 
3) Model 3 - with both categorical tenure_interval and continuous tenure.
Comment by Raghavan Madabusi on August 10, 2017 at 3:46am

@Nommel, Please click the link at the bottom of the post that says "read original post". The CM is located there

Comment by Davide Burlon on August 9, 2017 at 10:46pm
"Transforming continuous variable into meaningful factor variable will improve the model performance"
Why is that? Don't we sacrifice some information by factorizing a continuous variable?
Comment by Nommel Djedjero on May 15, 2017 at 9:55am

Hey guys, I have been looking over and over and I can't find the confusion matrix.

Comment by Raghavan Madabusi on April 18, 2017 at 3:08pm

@Liad,

Can you double check your calculation? I'm getting the following:

Accuracy: 80%

Precision: 66%

Recall: 52%

F1: 0.58 (scale 2)

Comment by Liad Magen on April 18, 2017 at 2:28am

Judging by your confusion matrix, It seems that your database is unbalanced. There are many 0-False, compared to 1-True. Meaning, that accuracy measurement doesn't give you a true indicator of your success rate.

Your precision is 51% which is like flipping a coin and your recall is 17%.
which makes your f1-score is only 25.5%

a better way would be to balance your database (take equal amounts of churn and non-churn users), or use anomaly detection methods instead.

Comment by Alessandro Trinca Arnould on April 17, 2017 at 9:42pm

Very interesting

Follow Us

Videos

  • Add Videos
  • View All

Resources

© 2017   Data Science Central   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service