Introduction
Every company at some point evaluates the churn analysis to understand the company’s customer loss rate. With all the research, the company then reduces the customer attrition rate by assessing their product and how customers use it. In this post, we shall be looking at an exciting dataset from Kaggle. You can find the dataset here at Ecommerce Customer Churn Analysis and Prediction | Kaggle
Understanding the dataset
To understand the data better, we first copy the column description from the above link. We have twenty columns from the dataset.
CustomerID | Unique customer ID |
Churn | Churn Flag |
Tenure | Tenure of the customer in the organization |
PreferredLoginDevice | Preferred login device of the customer |
CityTier | City tier |
WarehouseToHome | Distance between the warehouse and the customer’s home |
PreferredPaymentMode | The preferred payment method used by the customer |
Gender | Gender of the customer |
HourSpendOnApp | Number of hours spent on mobile application or website |
NumberOfDeviceRegistered | The total number of deceives registered under the customer’s name |
PreferedOrderCat | Preferred order category of the customer in last month |
SatisfactionScore | A satisfactory score of the customer on service |
MaritalStatus | Marital status of the customer |
NumberOfAddress | Address specifications of the customer |
Complain | Any complaint has been raised in last month |
OrderAmountHikeFromlastYear | Percentage increases in order from last year |
CouponUsed | The total number of coupons that were used last month |
OrderCount | The total number of orders that were placed in last month |
DaySinceLastOrder | Last order made by the customer |
CashbackAmount | Average cashback in last month |
Based on the dataset, there are 5360 observations. The churn column also shows whether a customer has been churned or not. In conclusion, the target variable is churn, and all other columns become feature variables. After the column description and the number of observations are established, let’s study the dataset.
Column Types
This dataset is slightly complex as it features both categorical and continuous data. So, we need to treat the columns differently based on their types. For the models to work, the data needs to be numeric only. Below are the columns along with their data types.
CustomerID | int64 |
Churn | int64 |
Tenure | float64 |
PreferredLoginDevice | object |
CityTier | int64 |
WarehouseToHome | float64 |
PreferredPaymentMode | object |
Gender | object |
HourSpendOnApp | float64 |
NumberOfDeviceRegistered | int64 |
PreferedOrderCat | object |
SatisfactionScore | int64 |
MaritalStatus | object |
NumberOfAddress | int64 |
Complain | int64 |
OrderAmountHikeFromlastYear | float64 |
CouponUsed | float64 |
OrderCount | float64 |
DaySinceLastOrder | float64 |
CashbackAmount | int64 |
Feature Engineering
Feature Engineering refers to the manipulation of features for statistical purposes. In this case, Feature refers to the columns in the observations excluding the target variable. Every machine learning model uses features for predictions, however, not all the features might be used. For instance, there are a few columns that do not statistically impact the target variable. Removal of unnecessary features helps improve model performance and saves time and memory when training the model.
In this case, the easiest thing to do is to remove the CustomerID column as it is not about dealing with one specific customer.
The next thing to do is to handle the missing values and remove the non-impacting columns.
Handling Missing Values
This particular dataset has missing values. Fortunately, all these missing values are in the numeric/continuous columns. If that was not the case, the intelligent thing to do is get rid of these observations. But since the columns are constant, one can fill these missing values with the mean of the column. Here’s the list of columns and numbers with missing observations.
Column Name | Column Type | No. of Missing Values |
Churn | int64 | 0 |
Tenure | float64 | 264 |
PreferredLoginDevice | object | 0 |
CityTier | int64 | 0 |
WarehouseToHome | float64 | 251 |
PreferredPaymentMode | object | 0 |
Gender | object | 0 |
HourSpendOnApp | float64 | 255 |
NumberOfDeviceRegistered | int64 | 0 |
PreferedOrderCat | object | 0 |
SatisfactionScore | int64 | 0 |
MaritalStatus | object | 0 |
NumberOfAddress | int64 | 0 |
Complain | int64 | 0 |
OrderAmountHikeFromlastYear | float64 | 265 |
CouponUsed | float64 | 256 |
OrderCount | float64 | 258 |
DaySinceLastOrder | float64 | 307 |
CashbackAmount | int64 | 0 |
Data In A Visual Format
To visualize the data a bit more, one can combine all the categorical variables. Look below for a better understanding:
Next, identify the outliers. Outliers are some observations that are deviating from the majority of the observations. There can be various reasons why there are outliers in the dataset. For detecting them, it is best to use box plots for each of the continuous variables. For removing the outliers, one can use the Quartile concepts.
Standard Scaling
All the data in the continuous variable columns need to be scaled. Typically, data needs to be distributed, and one cannot check the correlation between the variables. Use the StandardScaler to scale the data.
Correlation in Continuous Variables
Based on analysis and heatmaps one can conclude that : CouponUsed and OrderCount are interestingly strongly correlated, making sense as any user with more coupons can place more orders. However, it is just 0.66, so one can accept the data.
Correlation between Categorical Values
To check how the Categorical Values impact the Churn variable, one can apply the Chi-Squared test. The higher the value of the Chi-Squared test result, the greater the observation affects the target variable.
Below is the result of Chi-Squared test results:
Column | Chi-Squared Result |
PreferredLoginDevice | 0.982755 |
PreferredPaymentMode | 0.999967 |
Gender | 0.063260 |
PreferedOrderCat | 0.996104 |
MaritalStatus | 0.956507 |
The test results show that except for Gender, all other columns are independent, evident from the above visualization. So, please remove it from our features. Gender is kept in the dataset.
Labels Encoding
Label Encoding is an essential aspect of machine learning algorithms. Machine Learning models use numbers as they are mathematical models. However, the dataset may contain non-numeric data. For instance, in this dataset, Gender is the categorical variable and once can encode it with the Label Encoding. There are various methods for encoding the categorical data to numeric data.
However, the easiest would be to use the Label Encoder as only Binary values are present in this column. The importance of Gender, which is Male and Female, is converted to 0 or 1.
Test Train Split
Now that the dataset is ready, it’s time to split it into training and testing datasets. Train data is the data one uses to train the model. However, the model learns from the data and tunes according to it. Test data is the data used to test whether the model’s performance is acceptable or not. Once one trains the model, one can use the test data to check how accurately the model performs.
Ideally, one uses 70% to 30%. In other words, thirty percent of the data is used for testing, while seventy percent of the data go for the training.
Random Forest Classifier
Now that we have the test and train data, it is time to prepare the model for which one can use the Random Forest Classifier. What is this classifier?
What one is facing is a Supervised Classification problem – which means there are labels against every observation. There are various supervised classification algorithms such as Naive Bayes, Decision Trees, Random Forests, Logistic Regression Classification, Support Vector Machines, etc.
For accurate results, the best decision is to go with the Random Forest Classifier as it uses Decision Trees. A decision tree is a standard method of concluding by splitting the observations recursively.
An example of a Decision Tree for reaching a decision
Decision trees tend to overfit sometimes. However, when a large number of decision trees are combined, they tend to provide results with reasonable accuracy by averaging the results of the different trees. These trees are created by using multiple samples out of the dataset.
In this case, the Random Forest Classifier is used to train the model. With the help of the built-in library, the estimators are set to be 100. When the model is run, the result is 95% accurate.
Can we make it better?
Two vital pieces of the data are missing from our e-commerce dataset.
- Age of the Customer: This can add further value to the dataset as age can partition the dataset into different customer classes.
- Duration of Engagement: One needs to know how long customers browse the e-commerce site.
- Reason for Churn: Why does a person leave the website?
Conclusion
In the above post, we analyze the data from an e-commerce website. We saw how we could engineer features for our purposes. We filled the missing values by using the most straightforward method. However, this should be considered to know precisely how missing values can be filled.
The dataset is primarily missing age and duration, which are critical pieces of information. If we look at the market, e-commerce is a relatively new business model and aged people are not on it. This makes it all the more important to know the customer’s age as they leave the website. If we had the age group, we would probably create clusters of data using the K-Means algorithm.
Similarly, duration is also an important factor. Knowing how long people browse an e-commerce website can shed light on the quality of service. Duration and Reason for Churn are connected to each other. The reason for churn can help an e-commerce website focus on where to improve. Having the information about the duration of the engagement can reveal if people are leaving the website quickly or not. Many businesses are opening up based on this model. Hence duration can help identify where people spend the most time and if they are loyal to the brand.
Code
All the code used for this analysis is available at GitHub and can be found here – https://github.com/Mindtrades-Consulting/Customer-Churn-Analysis-Using-Statistical-Data-And-Python-Code
For similar analysis or for indepth research, head to mindtrades.com
How can MindTrades help?
This case study is only a tipping point to such in-depth analysis with insight and solutions. MindTrades Consulting Services, a leading marketing agency specialises in such case studies for the global IT sector including leading data integration brands such as Diyotta. From Cloud Migration, Big Data, Digital Transformation, Agile Deliver, Cyber Security, to Analytics- MindTrades provides published breakthrough ideas, and prompt content delivery. For more information, check https://www.mindtrades.com
FAQs:
- Who is this article most useful for?
Procuring such data and assembling it one place can be a big task for data scientist. But with a superior data integration tool it can be done with much ease. Create several columns of data with a drag and drop interface with data integration tools in the market.
2. What is churn analysis?
According to Profitwell.com, Churn analysis is the evaluation of a company’s customer loss rate in order to reduce it. Also referred to as customer attrition rate, churn can be minimized by assessing your product and how people use it.
3. How can such an analysis made with the help of Python help a business?
This case study is a hypothesis-driven, interactive approach to product development in a company. This way, you can convert and retain users. This is a more data-driven approach with insights that are acitonable.