How I Created a Customer Churn Prediction App to Help Businesses

Many businesses will agree that it takes a lot more time, money, and resources to get new customers than to keep existing ones. Hence, they are very much interested in knowing how many existing customers are leaving their business. This is known as churn.

Churn tells business owners how many customers are no longer using their products and services. It is also the rate at which an amount of money is lost as a result of customers or employers leaving the company. The churn rate gives companies an idea of business performance. If the churn rate is higher than the growth rate, it means that the business is not growing.

There are many reasons offered to explain customer churn. These include poor customer satisfaction, finance issues, customers not feeling appreciated, and customers’ need for a change. Understandably, companies have no absolute control over churn. But they can work to reduce to the barest minimum churn rate as regards the ones they have greater control.

As data scientists, your role is to assist these companies by building a churn model tailored to the company’s goals and expectations to predict customer churn. Due to the lack of data available to meet a company’s specific needs, it becomes challenging for data scientists to design an effective churn model.

However, we will make do with sample data for a fictional telecommunication company. You know, it is membership-based businesses performing subscription-based services that are mostly affected by customer churn. This data sourced by the IBM Developer Platform is available on my GitHub page.

The dataset has 7043 rows and 21 columns which comprise 17 categorical features, 3 numerical features, and the prediction feature. Check my GitHub page for more information about the dataset.

Data Preprocessing

This step will be taken to make the data suitable for machine learning. We will start by getting an overview of the dataset.

import pandas as pd
df = pd.read_csv('churn.csv')

# get the shape of the dataset
df.shape
(7043, 21)

# print the columns
df.columns
Index('customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

# check for missing values

df.isna().sum()
'''
customerID          0
gender                 0
SeniorCitizen       0
Partner                0
Dependents          0
tenure                  0
PhoneService        0
MultipleLines        0
InternetService      0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport          0
StreamingTV          0
StreamingMovies     0
Contract                 0
PaperlessBilling      0
PaymentMethod       0
MonthlyCharges      0
TotalCharges           0
Churn                      0
dtype: int64
'''

#check for duplicates
df.customerID.nunique()
7043

Next, we drop the customerID column which was just there for identification purposes.

df.drop(['customerID'], axis=1, inplace=True)

The axis=1 means the columns. The inplace parameter is directly applied to the dataset.

If you take a look at the dataset using the head() method, you will notice that many features including the target feature have rows with values of Yes and No. We will transform them to 0 and 1 using LabelEncoder from the Scikit-learn library. We will also do the same with columns that have more than two categories.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
obj = (df.dtypes == 'object')
for col in list(obj[obj].index):
     df[col] = label_encoder.fit_transform(df[col])

Model Building

It’s now time to train our data using Machine Learning algorithms. As we don’t know which model will perform well on our dataset, we will first test using different models.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

X = df.drop([β€˜Churn’], axis=1)
Y = df.Churn

X_train, X_test Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7)

models = [LogisticRegression(), RandomForestClassifier(),AdaBoostClassifier(),
    SVC(), DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB(),
    ExtraTreesClassifier(), LinearDiscriminantAnalysis(), GradientBoostingClassifier(),
 ] 

scaler = StandardScaler()
rescaledX = scaler.fit_transform(x_train)

for model in models:
    model.fit(rescaledX, Y_train.values)
    preds = model.predict(X_test.values)
    results = accuracy_score(Y_test, preds)
    print(f'{results}')

'''
0.2753726046841732
0.7388218594748048
0.7388218594748048
0.7388218594748048
0.2753726046841732
0.26330731014904185
0.47906316536550747
0.27324343506032645
0.7388218594748048
0.30376153300212916
0.6593328601845281
0.7402413058907026
'''

The results show that XGBoost performed better than the other models in this dataset. Therefore, we will use XGBoost as our Machine Learning algorithm to predict customer churn.

Tuning XGBoost

The XGBoost algorithm achieved a 74% accuracy score. Can it do better? Let’s try tuning the model using learning curves. To understand what we meant by the learning curve, please read this article.

models = [LogisticRegression(), RandomForestClassifier(),AdaBoostClassifier(),
    SVC(), DecisionTreeClassifier(), KNeighborsClassifier(), GaussianNB(),
    ExtraTreesClassifier(), LinearDiscriminantAnalysis(), GradientBoostingClassifier(),
 ] 

scaler = StandardScaler()
rescaledX = scaler.fit_transform(x_train)

for model in models:
    model.fit(rescaledX, Y_train.values)
    preds = model.predict(X_test.values)
    results = accuracy_score(Y_test, preds)
    print(f'{results}')

The results show that XGBoost performed better than the other models in this dataset. Therefore, we will use XGBoost as our Machine Learning algorithm to predict customer churn.

Tuning XGBoost

The XGBoost algorithm achieved a 74% accuracy score. Can it do better? Let’s try tuning the model using learning curves. To understand what we meant by the learning curve, please read this article.

# define the model
model = XGBClassifier()

# define the datasets to evaluate each iteration
evalset = [(X_train, Y_train), (X_test, Y_test)]

# fit the model
model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset)

# evaluate performance
preds = model.predict(X_test)
score = accuracy_score(y_test, preds)

print(f'Accuracy: {round(score*100, 1)}%')
# Accuracy: 77.9%

Wow, the model has improved with 77.9% accuracy score. Can it still do better? Let’s increase the number of iterations from 100 (default) to 200 and reduce the eta hyperparameter to 0.05 (default is 0.3) to slow down the learning rate.

model = XGBClassifier(n_estimators=200, eta=0.05)

# fit the model
model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset)

preds = model.predict(x_test)


score = accuracy_score(y_test,preds)

print(f'Accuracy: {round(score*100, 1)}%')
# Accuracy: 78.6%

This is the extent we can go. Of course, we can go on tuning the model to achieve a higher score. An accuracy score of 78.6% is not bad.

Create a new folder and save the following to a file named model.py.

#Import libraries
import pandas as pd
from xgboost import XGBClassifier
import pickle
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('churn.csv')

# Drop customerID
df.drop(['customerID'], axis=1, inplace=True)

# Convert to int datatype

label_encoder = LabelEncoder()
obj = (df.dtypes == β€˜object’)
for col in list(obj[obj].index):
     df[col] = label_encoder.fit_transform(df[col])


X = df.drop(['Churn'], axis=1)
Y = df.Churn

# splitting the dataset
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=7)


model = XGBClassifier(n_estimators=200, eta=0.05)


# define the datasets to evaluate each iteration
evalset = [(X_train, Y_train), (X_test, Y_test)]

# fit the model
model.fit(X_train, Y_train, eval_metric='logloss', eval_set=evalset)

# saving the trained model
pickle.dump(model, open('lg_model.pkl', 'wb'))

Notice we save the trained model as a pickle object to be used later. We want the model to be running on Streamlit local server. So, we will create a Streamlit application for this. Create other files called app.py and predict.py in your current folder. Check my GitHub page to see the full content of the files.

Please remember to manually run the model.py to generate the pickle file as I won’t be pushing it to GitHub. After running the model.py file, the accuracy was 80.4% showing the model learned the data very well.

Conclusion

In this tutorial, we created a customer churn prediction app to help businesses deal with some of the challenges facing them. We use the XGBoost model to train the data and generate the model. There are many things we didn’t do. Data visualization, feature engineering, and dealing with imbalance classification are some of them.

You may wish to try them out and see if they can improve the model’s performance. Unfortunately, I wasn’t able to deploy the app because I couldn’t push the heavy pickle file to GitHub. Try pushing yours and then, you deploy it on Streamlit Cloud. Alright, enjoy your day.