Data Analytics

Customer Churn Analysis Using Statistical Data And Python Code

Pinterest LinkedIn Tumblr


Every company at some point evaluates the churn analysis to understand the company’s customer loss rate. With all the research, the company then reduces the customer attrition rate by assessing their product and how customers use it. In this post, we shall be looking at an exciting dataset from Kaggle. You can find the dataset here at Ecommerce Customer Churn Analysis and Prediction | Kaggle

Understanding the dataset

To understand the data better, we first copy the column description from the above link. We have twenty columns from the dataset. 

CustomerID Unique customer ID
Churn Churn Flag
Tenure Tenure of the customer in the organization
PreferredLoginDevice Preferred login device of the customer
CityTier City tier
WarehouseToHome Distance between the warehouse and the customer’s home 
PreferredPaymentMode The preferred payment method used by the customer
Gender Gender of the customer
HourSpendOnApp Number of hours spent on mobile application or website
NumberOfDeviceRegistered The total number of deceives registered under the customer’s name 
PreferedOrderCat Preferred order category of the customer in last month
SatisfactionScore A satisfactory score of the customer on service
MaritalStatus Marital status of the customer
NumberOfAddress Address specifications of the customer
Complain Any complaint has been raised in last month
OrderAmountHikeFromlastYear Percentage increases in order from last year
CouponUsed The total number of coupons that were used last month
OrderCount The total number of orders that were placed in last month
DaySinceLastOrder Last order made by the customer
CashbackAmount Average cashback in last month

Based on the dataset, there are 5360 observations. The churn column also shows whether a customer has been churned or not. In conclusion, the target variable is churn, and all other columns become feature variables. After the column description and the number of observations are established, let’s study the dataset.

Column Types

This dataset is slightly complex as it features both categorical and continuous data. So, we need to treat the columns differently based on their types. For the models to work, the data needs to be numeric only. Below are the columns along with their data types. 

CustomerID int64
Churn int64
Tenure float64
PreferredLoginDevice object
CityTier int64
WarehouseToHome float64
PreferredPaymentMode object
Gender object
HourSpendOnApp float64
NumberOfDeviceRegistered int64
PreferedOrderCat object
SatisfactionScore int64
MaritalStatus object
NumberOfAddress int64
Complain int64
OrderAmountHikeFromlastYear float64
CouponUsed float64
OrderCount float64
DaySinceLastOrder float64
CashbackAmount int64

Feature Engineering

Feature Engineering refers to the manipulation of features for statistical purposes. In this case, Feature refers to the columns in the observations excluding the target variable. Every machine learning model uses features for predictions, however, not all the features might be used. For instance, there are a few columns that do not statistically impact the target variable. Removal of unnecessary features helps improve model performance and saves time and memory when training the model.

In this case, the easiest thing to do is to remove the CustomerID column as it is not about dealing with one specific customer. 

The next thing to do is to handle the missing values and remove the non-impacting columns.

Handling Missing Values

This particular dataset has missing values. Fortunately, all these missing values are in the numeric/continuous columns. If that was not the case, the intelligent thing to do is get rid of these observations. But since the columns are constant, one can fill these missing values with the mean of the column. Here’s the list of columns and numbers with missing observations.

Column Name Column Type No. of Missing Values
Churn int64 0
Tenure float64 264
PreferredLoginDevice object 0
CityTier int64 0
WarehouseToHome float64 251
PreferredPaymentMode object 0
Gender object 0
HourSpendOnApp float64 255
NumberOfDeviceRegistered int64 0
PreferedOrderCat object 0
SatisfactionScore int64 0
MaritalStatus object 0
NumberOfAddress int64 0
Complain int64 0
OrderAmountHikeFromlastYear float64 265
CouponUsed float64 256
OrderCount float64 258
DaySinceLastOrder float64 307
CashbackAmount int64 0

Data In A Visual Format 

To visualize the data a bit more, one can combine all the categorical variables. Look below for a better understanding: 

Next, identify the outliers. Outliers are some observations that are deviating from the majority of the observations. There can be various reasons why there are outliers in the dataset. For detecting them, it is best to use box plots for each of the continuous variables. For removing the outliers, one can use the Quartile concepts. 

Standard Scaling

All the data in the continuous variable columns need to be scaled. Typically, data needs to be distributed, and one cannot check the correlation between the variables. Use the StandardScaler to scale the data. 

Correlation in Continuous Variables

Based on analysis and heatmaps one can conclude that : CouponUsed and OrderCount are interestingly strongly correlated, making sense as any user with more coupons can place more orders. However, it is just 0.66, so one can accept the data. 

Correlation between Categorical Values

To check how the Categorical Values impact the Churn variable, one can apply the Chi-Squared test. The higher the value of the Chi-Squared test result, the greater the observation affects the target variable.

Below is the result of Chi-Squared test results: 

Column Chi-Squared Result
PreferredLoginDevice 0.982755
PreferredPaymentMode 0.999967
Gender 0.063260
PreferedOrderCat 0.996104
MaritalStatus 0.956507

The test results show that except for Gender, all other columns are independent, evident from the above visualization. So, please remove it from our features.  Gender is kept in the dataset. 

Labels Encoding

Label Encoding is an essential aspect of machine learning algorithms. Machine Learning models use numbers as they are mathematical models. However, the dataset may contain non-numeric data. For instance, in this dataset, Gender is the categorical variable and once can encode it with the Label Encoding. There are various methods for encoding the categorical data to numeric data.

However, the easiest would be to use the Label Encoder as only Binary values are present in this column. The importance of Gender, which is Male and Female, is converted to 0 or 1.

Test Train Split

Now that the dataset is ready, it’s time to split it into training and testing datasets. Train data is the data one uses to train the model. However, the model learns from the data and tunes according to it. Test data is the data used to test whether the model’s performance is acceptable or not. Once one trains the model, one can use the test data to check how accurately the model performs. 

Ideally, one uses 70% to 30%. In other words, thirty percent of the data is used for testing, while seventy percent of the data go for the training. 

Random Forest Classifier

Now that we have the test and train data, it is time to prepare the model for which one can use the Random Forest Classifier. What is this classifier?

What one is facing is a Supervised Classification problem – which means there are labels against every observation. There are various supervised classification algorithms such as Naive Bayes, Decision Trees, Random Forests, Logistic Regression Classification, Support Vector Machines, etc.

For accurate results, the best decision is to go with the Random Forest Classifier as it uses Decision Trees. A decision tree is a standard method of concluding by splitting the observations recursively. 

An example of a Decision Tree for reaching a decision

Decision trees tend to overfit sometimes. However, when a large number of decision trees are combined, they tend to provide results with reasonable accuracy by averaging the results of the different trees. These trees are created by using multiple samples out of the dataset. 

In this case, the Random Forest Classifier is used to train the model. With the help of the built-in library, the estimators are set to be 100. When the model is run, the result is 95% accurate. 

Can we make it better?

Two vital pieces of the data are missing from our e-commerce dataset. 

  1. Age of the Customer: This can add further value to the dataset as age can partition the dataset into different customer classes.
  2. Duration of Engagement: One needs to know how long customers browse the e-commerce site. 
  3. Reason for Churn: Why does a person leave the website?


In the above post, we analyze the data from an e-commerce website. We saw how we could engineer features for our purposes. We filled the missing values by using the most straightforward method. However, this should be considered to know precisely how missing values can be filled. 

The dataset is primarily missing age and duration, which are critical pieces of information. If we look at the market, e-commerce is a relatively new business model and aged people are not on it. This makes it all the more important to know the customer’s age as they leave the website. If we had the age group, we would probably create clusters of data using the K-Means algorithm.

Similarly, duration is also an important factor. Knowing how long people browse an e-commerce website can shed light on the quality of service. Duration and Reason for Churn are connected to each other. The reason for churn can help an e-commerce website focus on where to improve. Having the information about the duration of the engagement can reveal if people are leaving the website quickly or not. Many businesses are opening up based on this model. Hence duration can help identify where people spend the most time and if they are loyal to the brand. 


All the code used for this analysis is available at GitHub and can be found here –

For similar analysis or for indepth research, head to

How can MindTrades help?

This case study is only a tipping point to such in-depth analysis with insight and solutions. MindTrades Consulting Services, a leading marketing agency specialises in such case studies for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- MindTrades provides published breakthrough ideas, and prompt content delivery. For more information, check


  1. Who is this article most useful for?

Procuring such data and assembling it one place can be a big task for data scientist. But with a superior data integration tool it can be done with much ease. Create several columns of data with a drag and drop interface with data integration tools in the market.

2. What is churn analysis?

According to, Churn analysis is the evaluation of a company’s customer loss rate in order to reduce it. Also referred to as customer attrition rate, churn can be minimized by assessing your product and how people use it.

3.  How can such an analysis made with the help of Python help a business?

This case study is a hypothesis-driven, interactive approach to product development in a company. This way, you can convert and retain users. This is a more data-driven approach with insights that are acitonable.

Write A Comment