In this tutorial, I will walk you through a machine-learning project on Loan Eligibility Prediction with Python. Specifically, I will show you how to create and deploy machine learning web applications using Streamlit.
Streamlit makes it easy for data scientists with little or no knowledge of web development to develop and deploy machine learning apps quickly. Its compatibility with data science libraries makes it an excellent choice for data scientists looking to deploy their applications.
š You can try the live demo app here:
Prerequisites

Although I will try my best to explain some concepts and the steps I took in this project, I assumed you already have a basic knowledge of Python and its application in machine learning.
For Streamlit, I will only explain the concepts that have a bearing on this project. If you want to know more, you can check the documentation.
Loan Eligibility Prediction

Banks and other financial institutions give out loans to people. But before they approve the loan, they have to make sure the applicant is eligible to receive the loan. There are many factors to consider before deciding whether or not the applicant is eligible for the loan. Such factors are but not limited to credit history and the applicantās income.
To automate the loan approval process, banks and other financial institutions require the applicant to fill in a form in which some personal information will be gathered. These include gender, education, credit history, and so on. An applicantās loan request will either be approved or rejected based on such information.
In this project, we are going to build a Streamlit dashboard where our users will fill in their details and check if they are eligible for a loan or not. This is a classification problem. Hence, we will use machine learning with Python and a dataset containing information on customersā past transactions to solve the problem. So, letās get started.
The Dataset
Letās load our dataset using the Pandas library.
import pandas as pd data = pd.read_csv('LoanApprovalPrediction.csv') data.shape # (598, 13)
Our dataset contains 598 rows and 13 columns. Using the .info()
method, we can get more information about the dataset.
data.info()
We can see all the columns that make up the dataset. If you view the first five rows using data.head()
, you will notice that some columns are categorical but their datatypes are shown as object. More on this soon. Letās check if there are missing values.
data.isna().sum()
Output:
Loan_ID 0
Gender 0
Married 0
Dependents 12
Education 0
Self_Employed 0
ApplicantIncome 0
CoapplicantIncome 0
LoanAmount 21
Loan_Amount_Term 14
Credit_History 49
Property_Area 0
Loan_Status 0
dtype: int64
Wow! Our dataset contains lots of missing values. We have a lot of data cleaning to do. Finally, letās check if our Loan_ID
contains duplicates.
data.Loan_ID.nunique() # 598
Loan_ID
has the exact number of rows. No duplicates. So, we can safely drop it as it will not be used for training.
# Dropping Loan_ID column data.drop(['Loan_ID'], axis=1, inplace=True)
By setting the inplace
parameter to True
, we want the change to be directly applied to our dataset. The axis=1
parameter corresponds to the column side. Itās now time to clean and prepare our dataset for training.
Data Cleaning and Preparation

Seeing that our dataset contains many missing values, we have several options to choose from. It is either we drop the missing rows or we fill them up with a given value. To determine which action to take, letās first check the total number of missing values.
data.isna().sum().sum() # 96
The dataset contains 96 missing values representing 16% of our dataset, a not-so-insignificant number indeed. I choose to fill them up instead of dropping them. Letās fill them up with the mean value of their respective columns.
Oh! We canāt fill in a number in a categorical column. So, we will first convert the categorical columns to int datatype.
For this, we can choose to use Pandasā map
function or use LabelEncoder
from the Scikit-learn library.
If we use the Pandasā map
function, we will repeat the same process for every categorical column. If you are like me and donāt like constant repetition (DRY), you will choose the second option.
This, though, does not rule out the importance of Pandasā map
function. Therefore, to show its importance and to add to your knowledge, let me show you how to apply it to our dataset.
data.Gender = data.Gender.map({'Male': 0, 'Female':1})
With that, the Gender column gets converted to int
datatype. You will have to do it to all the columns involved. But since we are changing all our categorical columns to a binary number, we have to follow the easy way using LabelEncoder
.
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() obj = (data.dtypes == 'object') for col in list(obj[obj].index): data[col] = label_encoder.fit_transform(data[col])
We want to select only the columns with the datatype as object. We started by creating a Boolean in line 12 which returns True
to object datatypes. Then in line 13, we perform what we call a Boolean mask. This filters out only the columns with the object
datatype and transforms them to a binary number in each iteration.
You can confirm it using the .info()
method and you will see that all our categorical columns have been converted to int
datatype. Having done that, we can now fill in the missing values.
for col in data.columns: data[col] = data[col].fillna(data[col].mean())
We fill in the missing rows with the mean value of their respective columns. Again, you can confirm it by typing data.isnull().sum()
or data.isna().sum()
.
Model Training

Itās now time to train our data using selected models. We will first divide our model into two: features (x
) and target (y
) variables.
x = data.drop(['Loan_Status'], axis=1) y = data.Loan_Status
For each variable, we divide further into two for training and testing the model using train_test_split
from Scikit-Learn.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)
We reserved 30% of our dataset for testing the model. By setting a random_state
to a given number, we ensure we get the same set of data whenever the code is run. Itās now time to select a model.
We donāt know what algorithm or model will do well on our dataset. For this reason, we will test our data with different models and select the model with the highest accuracy score.
from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import RidgeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB form sklearn.ensembles import RandomForestClassifier models = [] models.append(('LR', LogisticRegression(max_iter=1000))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append((āSVCā, SVC())) models.append(('RC', RidgeClassifier())) models.append(('RF', RandomForestClassifier())) def modeling(model): model.fit(x_train, y_train) y_pred = model.predict(x_test) return accuracy_score(y_test, y_pred) * 100 for name, model in models: print(f'{name} = {modeling(model)}') LR = 80.83333333333333 LDA = 82.5 KNN = 63.74999999999999 CART = 68.33333333333333 NB = 81.66666666666667 SVC = 69.16666666666667 RC = 82.91666666666667 RF = 81.66666666666667
The result shows that Ridge Classifier performs more than the models, followed by Linear Discriminant Analysis with only a slight difference. Both could benefit from further study.
However, we will use the Ridge Classifier algorithm.
Here is the full code. Save the model as model.py
:
import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.linear_model import RidgeClassifier from sklearn.metrics import accuracy_score import pickle # load the data data = pd.read_csv('LoanApprovalPrediction.csv') # Drop Loan_ID column data.drop(['Loan_ID'], axis=1, inplace=True) # convert to int datatype label_encoder = LabelEncoder() obj = (data.dtypes == 'object') for col in list(obj[obj].index): data[col] = label_encoder.fit_transform(data[col]) # fill in missing rows for col in data.columns: data[col] = data[col].fillna(data[col].mean()) # divide model into features and target variable x = data.drop(['Loan_Status'], axis=1) y = data.Loan_Status # divide into training and testing data x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7) # define the model model = RidgeClassifier() # fit the model on the training data model.fit(x_train, y_train) #save the train model with open('train_model.pkl', mode='wb') as pkl: pickle.dump(model, pkl)
By saving our model in a pickle file, it can easily be called to make predictions, thus saving ourselves the time of waiting for the model to get trained each time itās run.
Preparing Streamlit Dashboard

Now that we are done with training our model. Letās prepare the Streamlit interface. We will start by defining our main function. Since we also want it to run when we open the Streamlit app, we will call it using the __name__
variable. Save this script with the name app.py
:
import streamlit as st def main(): bg = """<div style='background-color:black; padding:13px'> <h1 style='color:white'>Streamlit Loan Elgibility Prediction App</h1> </div>""" st.markdown(bg, unsafe_allow_html=True) left, right = st.columns((2,2)) gender = left.selectbox('Gender', ('Male', 'Female')) married = right.selectbox('Married', ('Yes', 'No')) dependent = left.selectbox('Dependents', ('None', 'One', 'Two', 'Three')) education = right.selectbox('Education', ('Graduate', 'Not Graduate')) self_employed = left.selectbox('Self-Employed', ('Yes', 'No')) applicant_income = right.number_input('Applicant Income') coApplicantIncome = left.number_input('Coapplicant Income') loanAmount = right.number_input('Loan Amount') loan_amount_term = left.number_input('Loan Tenor (in months)') creditHistory = right.number_input('Credit History', 0.0, 1.0) propertyArea = st.selectbox('Property Area', ('Semiurban', 'Urban', 'Rural')) button = st.button('Predict') # if button is clicked if button: # make prediction result = predict(gender, married, dependent, education, self_employed, applicant_income, coApplicantIncome, loanAmount, loan_amount_term, creditHistory, propertyArea) st.success(f'You are {result} for the loan')
We imported the Streamlit library. Then, we added color using HTML tags and since Python does not recognize such, we used the parameter unsafe_allow_html
to make it to be recognized, without which the black color will not appear.
We displayed several text boxes, and select boxes to get data from our users which will, in turn, be used to make predictions.
Notice that we used the exact data found in the datasets including their features. Since we have already transformed the categorical columns to int
datatypes, you may have to reload the dataset and use the .value_counts()
method on each column to see the features.
Letās now define our predict()
function.
# load the train model with open('train_model.pkl', 'rb') as pkl: train_model = pickle.load(pkl) def predict(gender, married, dependent, education, self_employed, applicant_income, coApplicantIncome, loanAmount, loan_amount_term, creditHistory, propertyArea): # processing user input gen = 0 if gender == 'Male' else 1 mar = 0 if married == 'Yes' else 1 dep = float(0 if dependent == 'None' else 1 if dependent == 'One' else 2 if dependent == 'Two' else 3) edu = 0 if education == 'Graduate' else 1 sem = 0 if self_employed == 'Yes' else 1 pro = 0 if propertyArea == 'Semiurban' else 1 if propertyArea == 'Urban' else 2 Lam = loanAmount / 1000 cap = coApplicantIncome / 1000 # making predictions prediction = train_model.predict([[gen, mar, dep, edu, sem, applicant_income, coApplicantIncome, Lam, loan_amount_term, creditHistory, pro]]) verdict = 'Not Eligible' if prediction == 0 else 'Eligible' return verdict
The predict()
function has all the features of our dataset. Then, we used a ternary operator to change the user input into a number. Notice that we converted the dep
variable to a float. We did all these things to ensure they correspond to the datatype in our datasets.
Also, we made sure that the order in which we placed our parameters both at the beginning and end of the function corresponds with the one in the main()
function. Anything contrary will either lead to an error or poor prediction.
Why did we divide the loanAmount
and coApplicantIncome
by 1,000? Well, I will leave that to you to answer. Just to give you a little hint, type this, data.loanAmount.describe()
, and see if you can figure it out yourself.
Conclusion

This is how we come to the end of this tutorial.
You have learned how to apply machine learning to a classification problem such as loan prediction.
You also learned how to create an interactive dashboard using Streamlit. Now, to deploy it on Streamlit Cloud so that others can use it, sign up on Streamlit and GitHub if you havenāt done so.
Check my GitHub page for the full code. Create a repository and deploy it to Streamlit Cloud. You can view my live demo app here. In a future article, I will show you how to use machine learning to solve a regression problem. Alright, have a nice day.