Many businesses will agree that it takes a lot more time, money, and resources to get new customers than to keep existing ones. Hence, they are very much interested in knowing how many existing customers are leaving their business. This is known as churn.
Churn tells business owners how many customers are no longer using their products and services. It is also the rate at which an amount of money is lost as a result of customers or employers leaving the company. The churn rate gives companies an idea of business performance. If the churn rate is higher than the growth rate, it means that the business is not growing.
There are many reasons offered to explain customer churn. These include poor customer satisfaction, finance issues, customers not feeling appreciated, and customers’ need for a change. Understandably, companies have no absolute control over churn. But they can work to reduce to the barest minimum churn rate as regards the ones they have greater control.
As data scientists, your role is to assist these companies by building a churn model tailored to the company’s goals and expectations to predict customer churn. Due to the lack of data available to meet a company’s specific needs, it becomes challenging for data scientists to design an effective churn model.
However, we will make do with sample data for a fictional telecommunication company. You know, it is membership-based businesses performing subscription-based services that are mostly affected by customer churn. This data sourced by the IBM Developer Platform is available on my GitHub page.
The dataset has 7043 rows and 21 columns which comprise 17 categorical features, 3 numerical features, and the prediction feature. Check my GitHub page for more information about the dataset.
Data Preprocessing
This step will be taken to make the data suitable for machine learning. We will start by getting an overview of the dataset.
import pandas as pd df = pd.read_csv('churn.csv') # get the shape of the dataset df.shape (7043, 21) # print the columns df.columns Index('customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'], dtype='object') # check for missing values df.isna().sum() ''' customerID 0 gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64 ''' #check for duplicates df.customerID.nunique() 7043
Next, we drop the customerID column which was just there for identification purposes.
df.drop(['customerID'], axis=1, inplace=True)
The axis=1
means the columns. The inplace
parameter is directly applied to the dataset.
If you take a look at the dataset using the head()
method, you will notice that many features including the target feature have rows with values of Yes and No. We will transform them to 0 and 1 using LabelEncoder
from the Scikit-learn library. We will also do the same with columns that have more than two categories.
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() obj = (df.dtypes == 'object') for col in list(obj[obj].index): df[col] = label_encoder.fit_transform(df[col])
Model Building
Itβs now time to train our data using Machine Learning algorithms. As we donβt know which model will perform well on our dataset, we will first test using different models.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from xgboost import XGBClassifier X = df.drop([βChurnβ], axis=1) Y = df.Churn X_train, X_test Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7) models = [LogisticRegression(), RandomForestClassifier(),AdaBoostClassifier(), SVC(), DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB(), ExtraTreesClassifier(), LinearDiscriminantAnalysis(), GradientBoostingClassifier(), ] scaler = StandardScaler() rescaledX = scaler.fit_transform(x_train) for model in models: model.fit(rescaledX, Y_train.values) preds = model.predict(X_test.values) results = accuracy_score(Y_test, preds) print(f'{results}') ''' 0.2753726046841732 0.7388218594748048 0.7388218594748048 0.7388218594748048 0.2753726046841732 0.26330731014904185 0.47906316536550747 0.27324343506032645 0.7388218594748048 0.30376153300212916 0.6593328601845281 0.7402413058907026 '''
The results show that XGBoost performed better than the other models in this dataset. Therefore, we will use XGBoost as our Machine Learning algorithm to predict customer churn.
Tuning XGBoost
The XGBoost algorithm achieved a 74% accuracy score. Can it do better? Let’s try tuning the model using learning curves. To understand what we meant by the learning curve, please read this article.
models = [LogisticRegression(), RandomForestClassifier(),AdaBoostClassifier(), SVC(), DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB(), ExtraTreesClassifier(), LinearDiscriminantAnalysis(), GradientBoostingClassifier(), ] scaler = StandardScaler() rescaledX = scaler.fit_transform(x_train) for model in models: model.fit(rescaledX, Y_train.values) preds = model.predict(X_test.values) results = accuracy_score(Y_test, preds) print(f'{results}')
The results show that XGBoost performed better than the other models in this dataset. Therefore, we will use XGBoost as our Machine Learning algorithm to predict customer churn.
Tuning XGBoost
The XGBoost algorithm achieved a 74% accuracy score. Can it do better? Let’s try tuning the model using learning curves. To understand what we meant by the learning curve, please read this article.
# define the model model = XGBClassifier() # define the datasets to evaluate each iteration evalset = [(X_train, Y_train), (X_test, Y_test)] # fit the model model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset) # evaluate performance preds = model.predict(X_test) score = accuracy_score(y_test, preds) print(f'Accuracy: {round(score*100, 1)}%') # Accuracy: 77.9%
Wow, the model has improved with 77.9% accuracy score. Can it still do better? Let’s increase the number of iterations from 100 (default) to 200 and reduce the eta hyperparameter to 0.05 (default is 0.3) to slow down the learning rate.
model = XGBClassifier(n_estimators=200, eta=0.05) # fit the model model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset) preds = model.predict(x_test) score = accuracy_score(y_test,preds) print(f'Accuracy: {round(score*100, 1)}%') # Accuracy: 78.6%
This is the extent we can go. Of course, we can go on tuning the model to achieve a higher score. An accuracy score of 78.6% is not bad.
Create a new folder and save the following to a file named model.py
.
#Import libraries import pandas as pd from xgboost import XGBClassifier import pickle from sklearn.model_selection import train_test_split from sklearn.preprocessing import LabelEncoder df = pd.read_csv('churn.csv') # Drop customerID df.drop(['customerID'], axis=1, inplace=True) # Convert to int datatype label_encoder = LabelEncoder() obj = (df.dtypes == βobjectβ) for col in list(obj[obj].index): df[col] = label_encoder.fit_transform(df[col]) X = df.drop(['Churn'], axis=1) Y = df.Churn # splitting the dataset X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7) model = XGBClassifier(n_estimators=200, eta=0.05) # define the datasets to evaluate each iteration evalset = [(X_train, Y_train), (X_test, Y_test)] # fit the model model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset) # saving the trained model pickle.dump(model, open('lg_model.pkl', 'wb'))
Notice we save the trained model as a pickle object to be used later. We want the model to be running on Streamlit local server. So, we will create a Streamlit application for this. Create other files called app.py
and predict.py
in your current folder. Check my GitHub page to see the full content of the files.
Please remember to manually run the model.py
to generate the pickle file as I won’t be pushing it to GitHub. After running the model.py
file, the accuracy was 80.4% showing the model learned the data very well.
Conclusion
In this tutorial, we created a customer churn prediction app to help businesses deal with some of the challenges facing them. We use the XGBoost model to train the data and generate the model. There are many things we didn’t do. Data visualization, feature engineering, and dealing with imbalance classification are some of them.
You may wish to try them out and see if they can improve the model’s performance. Unfortunately, I wasn’t able to deploy the app because I couldn’t push the heavy pickle file to GitHub. Try pushing yours and then, you deploy it on Streamlit Cloud. Alright, enjoy your day.