How I Built a House Price Prediction App Using Streamlit

In this tutorial, I will take you through a machine learning project on House Price prediction with Python. We have previously learned how to solve a classification problem.

Today, I will show you how to solve a regression problem and deploy it on Streamlit Cloud.

You can find an app prototype to try out here:

What Is Streamlit?

💡 Info: Streamlit is a popular choice for data scientists looking to deploy their apps quickly because it is easy to set up and is compatible with data science libraries. We are going to set up the dashboard so that when our users fill in some details, it will predict the price of a house.

But you may wonder:

Why Is House Price Prediction Important?

Well, house prices are an important reflection of the economy. The price of a property is important in real estate transactions as it provides information to stakeholders, including real estate agents, investors, and developers, to enable them to make informed decisions.

Governments also use such information to formulate appropriate regulatory policies. Overall, it helps all parties involved to determine the selling price of a house. With such information, they will then decide when to buy or sell a house.

We will use machine learning with Python to try to predict the price of a house. Having a background knowledge of Python and its usage in machine learning is a necessary prerequisite for this tutorial.

👉 Recommended: Python Crash Course (Blog + Cheat Sheets)

To keep things simple, we will not be dealing with data visualization.

The Datasets

We will be using California Housing Data of 1990 to make this prediction. You can get the dataset on Kaggle or you check my GitHub page. Let’s load it using the Pandas library and find the number of rows and columns.

import pandas as pd

data = pd.read_csv('housing.csv')
print(data.shape)
# (20640, 10)

We can see the dataset has 20640 rows and 10 features.

Let’s get more information about the columns using the .info() method.

data.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column                          Non-Null Count  Dtype
---  ------                              --------------  -----
 0   longitude                        20640 non-null  float64
 1   latitude                          20640 non-null  float64
 2   housing_median_age      20640 non-null  float64
 3   total_rooms                    20640 non-null  float64
 4   total_bedrooms               20433 non-null  float64
 5   population                      20640 non-null  float64
 6   households                     20640 non-null  float64
 7   median_income               20640 non-null  float64
 8   median_house_value       20640 non-null  float64
 9   ocean_proximity              20640 non-null  object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

The longitude indicates how far west a house is while the latitude shows how far north the house is.
The housing_median_age indicates the median age of a building. A lower number tells us that the house is newly constructed.
The total_rooms and total_bedrooms indicate the total number of rooms and bedrooms within a block.
The population tells us the number of people within a block while the households tell us the number of people living within a home unit of a block.
The median_income is measured in tens of thousands of US Dollars. It shows the median income of households living within a block.
The median_house_value is also measured in US Dollars. It is the median house value for households living in one block.
The ocean_proximity tells us how close to the sea a house is located.

The dataset has the same number of columns except total_bedroom indicating the presence of missing values. They are all of float datatype except ocean_proximity which is categorical even though it is shown as object. Let us first confirm this.

data.ocean_proximity.value_counts()

Output:

<1H OCEAN          9136
INLAND                6551
NEAR OCEAN       2658
NEAR BAY            2290
ISLAND                5
Name: ocean_proximity, dtype: int64

It is categorical. So, we have to convert the ocean_proximity to int datatype using labelEncoder from the Scikit-learn library.

from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
obj = (data.dtypes == 'object')

for col in list(obj[obj].index):
    data[col] = label_encoder.fit_transform(data[col])

Let’s check to confirm.

data.ocean_proximity.value_counts()

Output:

0    9136
1    6551
4    2658
3    2290
2          5
Name: ocean_proximity, dtype: int64

Take note of the way labelEncoder ordered the values. We will apply this when creating our Streamlit dashboard. We then fill in the missing values with the mean of their respective columns.

for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())

print(data.isna().sum())

Output:

longitude                      0
latitude                        0
housing_median_age    0
total_rooms                  0
total_bedrooms            0
population                   0
households                  0
median_income            0
median_house_value    0
ocean_proximity           0
dtype: int64

Having confirmed that there are no missing values, we can now proceed to the next step.

Standardizing the Data

If you take a glimpse of our data using the .head() method, you will observe that the data is of differing scales.

This will affect the model’s ability to perform accurate predictions.

Hence, we will have to standardize our data using StandardScaler from Scikit-learn. Also, to prevent data leakage, we will make use of pipelines.

The Models

We have no idea which algorithm or model will perform well in this regression problem.

A test will be carried out on different algorithms using default tuning parameters. Since this is a regression problem, we will be using 10-fold cross-validation to design our test harness and evaluate the models using R Squared metric.

💡 Info: The R Squared metric is an indication of goodness of fit. It is between 0 and 1. The closer to 1 the better. When the value is 1, it means a perfect fit.

K-fold cross-validation works by splitting the datasets into several parts (10 folds in our case).

The algorithm is trained repeatedly on each fold with one held back for testing. We chose this approach over train_test_split method because it gives us a more accurate and reliable result as the model is trained and evaluated repeatedly on different data.

from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, ElasticNet
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
import bz2

pipelines = []
pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()),
                                        ('LR', LinearRegression())])))
pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()),
                                           ('LASSO', Lasso())])))
pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()),
                                        ('EN', ElasticNet())])))
pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()),
                                         ('KNN', KNeighborsRegressor())])))
pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()),
                                          ('CART', DecisionTreeRegressor())])))
pipelines.append(('ScaledSVR', Pipeline([('Scaler', StandardScaler()),
                                         ('SVR', SVR())])))


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)

def modeling(models):
    for name, model in models:
        kfold = KFold(n_splits=10)
        results = cross_val_score(model, x_train, y_train, cv = kfold, scoring='r2')
        print(f'{name} = {results.mean()}')

Notice how we used Pipeline while standardizing our models. We then created a function that used 10-fold cross validation to repeatedly train our models. Then, the result is displayed using R Squared metric.

modeling(pipelines)

ScaledLR = 0.6321641933826154
ScaledLASSO = 0.6321647820595134
ScaledEN = 0.4953062096224026
ScaledKNN = 0.7106787517028879
ScaledCART = 0.6207570733565403
ScaledSVR = -0.05047991785208246

The results show that KNN benefited from scaling the data. Let’s see if we can improve the result by tuning KNN parameters.

Tuning the Parameters

The default number of neighbors of KNN is 7, and with it KNN achieved good results. We will conduct a grid search to identify which parameters will yield an even greater score.

scaler = StandardScaler().fit(x_train)
rescaledx = scaler.transform(x_train)
k = list(range(1, 31))
kfold = KFold(n_splits=10)
grid = GridSearchCV(model, param_grid=param_grid, cv = k, scoring='r2')
grid_result = grid.fit(rescaledx, y_train)

print(f'Best: {grid_result.best_score_} using {grid_result.best_params_}')
# Best: 0.7242988300529242 using {'n_neighbors': 14}

The best for k is 14 with a mean score of 0.7243, slightly improved compared to the previous score.

Can we better this score? Yes, of course. I’m aiming for 80% and above accuracy. In that case, we will try using ensemble methods.

Ensemble Methods

Let’s see what we can achieve using 4 different ensemble machine learning algorithms. Everything other than the models remains the same.

from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor

# ensembles
ensembles = []
ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()),
                                        ('AB', AdaBoostRegressor())])))
ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()),
                                         ('GBM', GradientBoostingRegressor())])))
ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()),
                                        ('RF', RandomForestRegressor())])))
ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()),
                                        ('ET', ExtraTreesRegressor())])))

for name, model in ensembles:
    cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='r2')
    print(f'{name} = {cv_results.mean()}')

Output:

ScaledAB = 0.3835320642243155
ScaledGBM = 0.772428054038791
ScaledRF = 0.81023174859107
ScaledET = 0.7978581384771901

Random Forest Regressor achieved the highest score, and it’s what we are aiming for. Therefore, we are selecting the Random Forest Regressor algorithm to train and predict the price of a building. But can it do better than this? Sure, given that we trained only on default tuning parameters.

Here is the full code. Save it as model.py.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, KFold, cross_val_score
import pickle

data = pd.read_csv('housing.csv')
# select only 1000 rows
data = data[:1000]
# converting categorical column to int datatype
label_encoder = LabelEncoder()
obj = (data.dtypes == 'object')
for col in list(obj[obj].index):
    data[col] = label_encoder.fit_transform(data[col])

# filling in missing values
for col in data.columns:
    data[col] = data[col].fillna(data[col].mean())

# making data a numpy array like
x = data.drop(['median_house_value'], axis=1)
y = data.median_house_value
x = x.values
y = y.values
# dividing data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7)

# standardzing the data
stds = StandardScaler()
scaler = stds.fit(x_train)
rescaledx = scaler.transform(x_train)

# selecting and fitting the model for training
model = RandomForestRegressor()
model.fit(rescaledx, y_train)
# saving the trained mode
pickle.dump(model, open('rf_model.pkl', 'wb'))
# saving StandardScaler
pickle.dump(stds, open('scaler.pkl', 'wb'))

We selected only 1000 rows to reduce pickled size.

Notice that we saved the StandardScaler() function to be used while creating the Streamlit dashboard. Since we scaled the dataset, we also expect to scale the input details from our users.

Streamlit Dashboard

It’s now time to design our Streamlit app. Once again, we will try to keep things simple and avoid complex designs. Save the following code as app.py.

import streamlit as st
import pickle


def main():
    style = """<div style='background-color:pink; padding:12px'>
              <h1 style='color:black'>House Price Prediction App</h1>
       </div>"""
    st.markdown(style, unsafe_allow_html=True)
    left, right = st.columns((2,2))
    longitude = left.number_input('Enter the Longitude in negative number',
                                  step =1.0, format="%.2f", value=-21.34)
    latitude = right.number_input('Enter the Latitude in positive number',
                                  step=1.0, format='%.2f', value= 35.84)
    housing_median_age = left.number_input('Enter the median age of the building',
                                           step=1.0, format='%.1f', value=25.0)
    total_rooms = right.number_input('How many rooms are there in the house?',
                                     step=1.0, format='%.1f', value=56.0)
    total_bedrooms = left.number_input('How many bedrooms are there in the house?',
                                       step=1.0, format='%.1f', value=15.0)
    population = right.number_input('Population of people within a block',
                                    step=1.0, format='%.1f', value=250.0)
    households = left.number_input('Poplulation of a household',  step=1.0,
                                   format='%.1f',value=43.0)
    median_income = right.number_input('Median_income of a household in Dollars',
                                       step=1.0, format='%.1f', value=3000.0)    
    ocean_proximity = st.selectbox('How close to the sea is the house?',
                    ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'))
    button = st.button('Predict')
    
    # if button is pressed
    if button:
        
        # make prediction
        result = predict(longitude, latitude, housing_median_age,
                         total_rooms,total_bedrooms, population,
                         households, median_income, ocean_proximity)
        st.success(f'The value of the house is ${result}')

We imported Streamlit and other libraries. Then we defined our main function. We want it to be executed as soon as we open the app. So, we will call the function using the __name__ variable at the very last of our script.

The unsafe_allow_html makes it possible for the HTML tags to be executed by Python.

With st.columns, we were able to display our variables side by side. We formatted each variable to be the same datatype in our dataset. If the button is pressed, then a callback function, the predict() function, is executed.

👉 Recommended: Streamlit Button — A Helpful Guide

Let’s now define the predict() function.

# load the train model
with open('rf_model.pkl', 'rb') as rf:
    model = pickle.load(rf)


# load the StandardScaler
with open('scaler.pkl', 'rb') as stds:
    scaler = pickle.load(stds)


def predict(longitude, latitude, housing_median_age,
            total_rooms, total_bedrooms, population, 
            households, median_income, ocean_pro):
    
    # processing user input
    ocean = 0 if ocean_pro == '<1H OCEAN' else 1 if ocean_pro == 'INLAND' else 2 if ocean_pro == 'ISLAND' else 3 if ocean_pro == 'NEAR BAY' else 4
    med_income = median_income / 5
    lists = [longitude, latitude, housing_median_age, total_rooms,
             total_bedrooms, population, households, med_income, ocean]
    
    df = pd.DataFrame(lists).transpose()

    # scaling the data
    scaler.transform(df)

    # making predictions using the train model
    prediction = model.predict(df)

    result = int(prediction)
    return result

We started by loading the train model and StandardScaler we saved earlier.

In the predict() function, we use a ternary operator to turn user input into a number. More info about this operator in the referenced blog tutorial or this video:

Notice that we made sure it corresponds with the number assigned by LabelEncoder. If you are ever in doubt, use the .value_counts() method on the categorical column to confirm.

We divided the median_income by 5 since the corresponding column in our dataset is said to be in tens of thousands of Dollars. However, this may not be necessary given that StandardScaler finally scaled the data. We did it just to be on the safe side.

The double parentheses are our way of instructing Python to turn the given inputs into a DataFrame. We also made sure the order of the parameters in the predict() function corresponds accordingly.

If the function seems to predict the same amount despite changes to the input details, then you may check the correlation the target variable has over the features by typing data.corr().

If we were to apply Recursive Feature Elimination (RFE) to select the best features capable of predicting the target variable, it would select just 4: longitude, latitude, median_income, and ocean_proximity. Let me show you what I mean.

from sklearn.feature_selection import RFE
model = RandomForestRegressor()

rfe = RFE(model)
fit = rfe.fit(x,y)

print(fit.n_features_)
# 4

print(fit.support_)
# array([ True,  True, False, False, False, False, False,  True,  True])

print(fit.ranking_)
# array([1, 1, 2, 6, 3, 5, 4, 1, 1])

Only 4 features are capable of predicting the target variable. If you kept getting the same amount, that may be the reason.

The purpose of this tutorial is purely educational, to demonstrate how to use Python to solve machine learning problems. I tried to keep things simple by not going through data visualization and feature engineering. Since the data is old, it should not be relied on when making important decisions.

We finally came to the end of the tutorial. Be sure to check my GitHub page to see the full project code.

To deploy on Streamlit Cloud, I assume you have already created a repository and added the required files. Then, you create an account on Streamlit Cloud, and input your repository URL. Streamlit will do the rest.

I have already deployed mine on Streamlit Cloud. Alright, enjoy your day.