How I Built and Deployed a Python Loan Eligibility Prediction App on Streamlit

In this tutorial, I will walk you through a machine-learning project on Loan Eligibility Prediction with Python. Specifically, I will show you how to create and deploy machine learning web applications using Streamlit.

Streamlit makes it easy for data scientists with little or no knowledge of web development to develop and deploy machine learning apps quickly. Its compatibility with data science libraries makes it an excellent choice for data scientists looking to deploy their applications.

šŸ‘‰ You can try the live demo app here:

Prerequisites

Although I will try my best to explain some concepts and the steps I took in this project, I assumed you already have a basic knowledge of Python and its application in machine learning.

For Streamlit, I will only explain the concepts that have a bearing on this project. If you want to know more, you can check the documentation.

Loan Eligibility Prediction

Banks and other financial institutions give out loans to people. But before they approve the loan, they have to make sure the applicant is eligible to receive the loan. There are many factors to consider before deciding whether or not the applicant is eligible for the loan. Such factors are but not limited to credit history and the applicantā€™s income.

To automate the loan approval process, banks and other financial institutions require the applicant to fill in a form in which some personal information will be gathered. These include gender, education, credit history, and so on. An applicantā€™s loan request will either be approved or rejected based on such information.

In this project, we are going to build a Streamlit dashboard where our users will fill in their details and check if they are eligible for a loan or not. This is a classification problem. Hence, we will use machine learning with Python and a dataset containing information on customersā€™ past transactions to solve the problem. So, letā€™s get started.

The Dataset

Letā€™s load our dataset using the Pandas library.

import pandas as pd
data = pd.read_csv('LoanApprovalPrediction.csv')
data.shape
# (598, 13)

Our dataset contains 598 rows and 13 columns. Using the .info() method, we can get more information about the dataset.

data.info()

We can see all the columns that make up the dataset. If you view the first five rows using data.head(), you will notice that some columns are categorical but their datatypes are shown as object. More on this soon. Letā€™s check if there are missing values.

data.isna().sum()

Output:

Loan_ID                       0
Gender                         0
Married                        0
Dependents                 12
Education                    0
Self_Employed             0
ApplicantIncome          0
CoapplicantIncome      0
LoanAmount                21
Loan_Amount_Term     14
Credit_History              49
Property_Area               0
Loan_Status                 0
dtype: int64

Wow! Our dataset contains lots of missing values. We have a lot of data cleaning to do. Finally, letā€™s check if our Loan_ID contains duplicates.

data.Loan_ID.nunique()
# 598

Loan_ID has the exact number of rows. No duplicates. So, we can safely drop it as it will not be used for training.

# Dropping Loan_ID column
data.drop(['Loan_ID'], axis=1, inplace=True)

By setting the inplace parameter to True, we want the change to be directly applied to our dataset. The axis=1 parameter corresponds to the column side. Itā€™s now time to clean and prepare our dataset for training.

Data Cleaning and Preparation

Seeing that our dataset contains many missing values, we have several options to choose from. It is either we drop the missing rows or we fill them up with a given value. To determine which action to take, letā€™s first check the total number of missing values.

data.isna().sum().sum()
# 96

The dataset contains 96 missing values representing 16% of our dataset, a not-so-insignificant number indeed. I choose to fill them up instead of dropping them. Letā€™s fill them up with the mean value of their respective columns.

Oh! We canā€™t fill in a number in a categorical column. So, we will first convert the categorical columns to int datatype.

For this, we can choose to use Pandasā€™ map function or use LabelEncoder from the Scikit-learn library.

If we use the Pandasā€™ map function, we will repeat the same process for every categorical column. If you are like me and donā€™t like constant repetition (DRY), you will choose the second option.

This, though, does not rule out the importance of Pandasā€™ map function. Therefore, to show its importance and to add to your knowledge, let me show you how to apply it to our dataset.

data.Gender = data.Gender.map({'Male': 0, 'Female':1})

With that, the Gender column gets converted to int datatype. You will have to do it to all the columns involved. But since we are changing all our categorical columns to a binary number, we have to follow the easy way using LabelEncoder.

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
obj = (data.dtypes == 'object')
for col in list(obj[obj].index):
    data[col] = label_encoder.fit_transform(data[col])

We want to select only the columns with the datatype as object. We started by creating a Boolean in line 12 which returns True to object datatypes. Then in line 13, we perform what we call a Boolean mask. This filters out only the columns with the object datatype and transforms them to a binary number in each iteration.

You can confirm it using the .info() method and you will see that all our categorical columns have been converted to int datatype. Having done that, we can now fill in the missing values.

for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())

We fill in the missing rows with the mean value of their respective columns. Again, you can confirm it by typing data.isnull().sum() or data.isna().sum().

Model Training

Itā€™s now time to train our data using selected models. We will first divide our model into two: features (x) and target (y) variables.

x = data.drop(['Loan_Status'], axis=1)
y = data.Loan_Status

For each variable, we divide further into two for training and testing the model using train_test_split from Scikit-Learn.

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)

We reserved 30% of our dataset for testing the model. By setting a random_state to a given number, we ensure we get the same set of data whenever the code is run. Itā€™s now time to select a model.

We donā€™t know what algorithm or model will do well on our dataset. For this reason, we will test our data with different models and select the model with the highest accuracy score.

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
form sklearn.ensembles import RandomForestClassifier

models = []
models.append(('LR', LogisticRegression(max_iter=1000)))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append((ā€˜SVCā€™, SVC()))
models.append(('RC', RidgeClassifier()))
models.append(('RF', RandomForestClassifier()))


def modeling(model):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    return accuracy_score(y_test, y_pred) * 100
     
for name, model in models:
    print(f'{name} = {modeling(model)}')
     
LR = 80.83333333333333
LDA = 82.5
KNN = 63.74999999999999
CART = 68.33333333333333
NB = 81.66666666666667
SVC = 69.16666666666667
RC = 82.91666666666667
RF = 81.66666666666667

The result shows that Ridge Classifier performs more than the models, followed by Linear Discriminant Analysis with only a slight difference. Both could benefit from further study.

However, we will use the Ridge Classifier algorithm.

Here is the full code. Save the model as model.py:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.metrics import accuracy_score
import pickle


# load the data
data = pd.read_csv('LoanApprovalPrediction.csv')
# Drop Loan_ID column
data.drop(['Loan_ID'], axis=1, inplace=True)
# convert to int datatype
label_encoder = LabelEncoder()
obj = (data.dtypes == 'object')
for col in list(obj[obj].index):
    data[col] = label_encoder.fit_transform(data[col])

# fill in missing rows
for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())
# divide model into features and target variable
x = data.drop(['Loan_Status'], axis=1)
y = data.Loan_Status

# divide into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)
# define the model
model = RidgeClassifier()
# fit the model on the training data
model.fit(x_train, y_train)
#save the train model
with open('train_model.pkl', mode='wb') as pkl:
    pickle.dump(model, pkl)

By saving our model in a pickle file, it can easily be called to make predictions, thus saving ourselves the time of waiting for the model to get trained each time itā€™s run.

Preparing Streamlit Dashboard

Now that we are done with training our model. Letā€™s prepare the Streamlit interface. We will start by defining our main function. Since we also want it to run when we open the Streamlit app, we will call it using the __name__ variable. Save this script with the name app.py:

import streamlit as st

def main():
    bg = """<div style='background-color:black; padding:13px'>
              <h1 style='color:white'>Streamlit Loan Elgibility Prediction App</h1>
       </div>"""
    st.markdown(bg, unsafe_allow_html=True)

    left, right = st.columns((2,2))
    gender = left.selectbox('Gender', ('Male', 'Female'))
    married = right.selectbox('Married', ('Yes', 'No'))
    dependent = left.selectbox('Dependents', ('None', 'One', 'Two', 'Three'))
    education = right.selectbox('Education', ('Graduate', 'Not Graduate'))
    self_employed = left.selectbox('Self-Employed', ('Yes', 'No'))
    applicant_income = right.number_input('Applicant Income')
    coApplicantIncome = left.number_input('Coapplicant Income')
    loanAmount = right.number_input('Loan Amount')
    loan_amount_term = left.number_input('Loan Tenor (in months)')
    creditHistory = right.number_input('Credit History', 0.0, 1.0)
    propertyArea = st.selectbox('Property Area', ('Semiurban', 'Urban', 'Rural'))
    button = st.button('Predict')


    # if button is clicked
    if button:
        # make prediction
        result = predict(gender, married, dependent, education, self_employed, applicant_income,
                        coApplicantIncome, loanAmount, loan_amount_term, creditHistory, propertyArea)
        st.success(f'You are {result} for the loan')

We imported the Streamlit library. Then, we added color using HTML tags and since Python does not recognize such, we used the parameter unsafe_allow_html to make it to be recognized, without which the black color will not appear.

We displayed several text boxes, and select boxes to get data from our users which will, in turn, be used to make predictions.

Notice that we used the exact data found in the datasets including their features. Since we have already transformed the categorical columns to int datatypes, you may have to reload the dataset and use the .value_counts() method on each column to see the features.

Letā€™s now define our predict() function.

# load the train model
with open('train_model.pkl', 'rb') as pkl:
    train_model = pickle.load(pkl)


def predict(gender, married, dependent, education, self_employed, applicant_income,
           coApplicantIncome, loanAmount, loan_amount_term, creditHistory, propertyArea):
    # processing user input
    gen = 0 if gender == 'Male' else 1
    mar = 0 if married == 'Yes' else 1
    dep = float(0 if dependent == 'None' else 1 if dependent == 'One' else 2 if dependent == 'Two' else 3)
    edu = 0 if education == 'Graduate' else 1
    sem = 0 if self_employed == 'Yes' else 1
    pro = 0 if propertyArea == 'Semiurban' else 1 if propertyArea == 'Urban' else 2
    Lam = loanAmount / 1000
    cap = coApplicantIncome / 1000
     # making predictions
    prediction = train_model.predict([[gen, mar, dep, edu, sem, applicant_income, coApplicantIncome,
                                      Lam, loan_amount_term, creditHistory, pro]])
    verdict = 'Not Eligible' if prediction == 0 else 'Eligible'
    return verdict

The predict() function has all the features of our dataset. Then, we used a ternary operator to change the user input into a number. Notice that we converted the dep variable to a float. We did all these things to ensure they correspond to the datatype in our datasets.

Also, we made sure that the order in which we placed our parameters both at the beginning and end of the function corresponds with the one in the main() function. Anything contrary will either lead to an error or poor prediction.

Why did we divide the loanAmount and coApplicantIncome by 1,000? Well, I will leave that to you to answer. Just to give you a little hint, type this, data.loanAmount.describe(), and see if you can figure it out yourself.

Conclusion

This is how we come to the end of this tutorial.

You have learned how to apply machine learning to a classification problem such as loan prediction.

You also learned how to create an interactive dashboard using Streamlit. Now, to deploy it on Streamlit Cloud so that others can use it, sign up on Streamlit and GitHub if you havenā€™t done so.

Check my GitHub page for the full code. Create a repository and deploy it to Streamlit Cloud. You can view my live demo app here. In a future article, I will show you how to use machine learning to solve a regression problem. Alright, have a nice day.