In this tutorial, I will take you through a machine learning project on House Price prediction with Python. We have previously learned how to solve a classification problem.
π Recommended: How I Built and Deployed a Python Loan Eligibility Prediction App on Streamlit
Today, I will show you how to solve a regression problem and deploy it on Streamlit Cloud.
You can find an app prototype to try out here:
What Is Streamlit?
π‘ Info: Streamlit is a popular choice for data scientists looking to deploy their apps quickly because it is easy to set up and is compatible with data science libraries. We are going to set up the dashboard so that when our users fill in some details, it will predict the price of a house.
But you may wonder:
Why Is House Price Prediction Important?
Well, house prices are an important reflection of the economy. The price of a property is important in real estate transactions as it provides information to stakeholders, including real estate agents, investors, and developers, to enable them to make informed decisions.
Governments also use such information to formulate appropriate regulatory policies. Overall, it helps all parties involved to determine the selling price of a house. With such information, they will then decide when to buy or sell a house.
We will use machine learning with Python to try to predict the price of a house. Having a background knowledge of Python and its usage in machine learning is a necessary prerequisite for this tutorial.
π Recommended: Python Crash Course (Blog + Cheat Sheets)
To keep things simple, we will not be dealing with data visualization.
The Datasets
We will be using California Housing Data of 1990 to make this prediction. You can get the dataset on Kaggle or you check my GitHub page. Letβs load it using the Pandas library and find the number of rows and columns.
import pandas as pd data = pd.read_csv('housing.csv') print(data.shape) # (20640, 10)
We can see the dataset has 20640 rows and 10 features.
Letβs get more information about the columns using the .info()
method.
data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 20640 non-null float64
1 latitude 20640 non-null float64
2 housing_median_age 20640 non-null float64
3 total_rooms 20640 non-null float64
4 total_bedrooms 20433 non-null float64
5 population 20640 non-null float64
6 households 20640 non-null float64
7 median_income 20640 non-null float64
8 median_house_value 20640 non-null float64
9 ocean_proximity 20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
- The
longitude
indicates how far west a house is while thelatitude
shows how far north the house is. - The
housing_median_age
indicates the median age of a building. A lower number tells us that the house is newly constructed. - The
total_rooms
andtotal_bedrooms
indicate the total number of rooms and bedrooms within a block. - The population tells us the number of people within a block while the households tell us the number of people living within a home unit of a block.
- The
median_income
is measured in tens of thousands of US Dollars. It shows the median income of households living within a block. - The
median_house_value
is also measured in US Dollars. It is the median house value for households living in one block. - The
ocean_proximity
tells us how close to the sea a house is located.
The dataset has the same number of columns except total_bedroom
indicating the presence of missing values. They are all of float
datatype except ocean_proximity
which is categorical even though it is shown as object. Let us first confirm this.
data.ocean_proximity.value_counts()
Output:
<1H OCEAN 9136
INLAND 6551
NEAR OCEAN 2658
NEAR BAY 2290
ISLAND 5
Name: ocean_proximity, dtype: int64
It is categorical. So, we have to convert the ocean_proximity
to int datatype using labelEncoder
from the Scikit-learn library.
from sklearn.preprocessing import LabelEncoder label_encoder = LabelEncoder() obj = (data.dtypes == 'object') for col in list(obj[obj].index): data[col] = label_encoder.fit_transform(data[col])
Let’s check to confirm.
data.ocean_proximity.value_counts()
Output:
0 9136
1 6551
4 2658
3 2290
2 5
Name: ocean_proximity, dtype: int64
Take note of the way labelEncoder
ordered the values. We will apply this when creating our Streamlit dashboard. We then fill in the missing values with the mean of their respective columns.
for col in data.columns: data[col] = data[col].fillna(data[col].mean()) print(data.isna().sum())
Output:
longitude 0
latitude 0
housing_median_age 0
total_rooms 0
total_bedrooms 0
population 0
households 0
median_income 0
median_house_value 0
ocean_proximity 0
dtype: int64
Having confirmed that there are no missing values, we can now proceed to the next step.
Standardizing the Data
If you take a glimpse of our data using the .head()
method, you will observe that the data is of differing scales.
This will affect the modelβs ability to perform accurate predictions.
Hence, we will have to standardize our data using StandardScaler from Scikit-learn. Also, to prevent data leakage, we will make use of pipelines.
The Models
We have no idea which algorithm or model will perform well in this regression problem.
A test will be carried out on different algorithms using default tuning parameters. Since this is a regression problem, we will be using 10-fold cross-validation to design our test harness and evaluate the models using R Squared metric.
π‘ Info: The R Squared metric is an indication of goodness of fit. It is between 0 and 1. The closer to 1 the better. When the value is 1, it means a perfect fit.
K-fold cross-validation works by splitting the datasets into several parts (10 folds in our case).
The algorithm is trained repeatedly on each fold with one held back for testing. We chose this approach over train_test_split
method because it gives us a more accurate and reliable result as the model is trained and evaluated repeatedly on different data.
from sklearn.svm import SVR from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.linear_model import LinearRegression, Lasso, ElasticNet from sklearn.model_selection import KFold, cross_val_score, train_test_split from sklearn.pipeline import Pipeline import bz2 pipelines = [] pipelines.append(('ScaledLR', Pipeline([('Scaler', StandardScaler()), ('LR', LinearRegression())]))) pipelines.append(('ScaledLASSO', Pipeline([('Scaler', StandardScaler()), ('LASSO', Lasso())]))) pipelines.append(('ScaledEN', Pipeline([('Scaler', StandardScaler()), ('EN', ElasticNet())]))) pipelines.append(('ScaledKNN', Pipeline([('Scaler', StandardScaler()), ('KNN', KNeighborsRegressor())]))) pipelines.append(('ScaledCART', Pipeline([('Scaler', StandardScaler()), ('CART', DecisionTreeRegressor())]))) pipelines.append(('ScaledSVR', Pipeline([('Scaler', StandardScaler()), ('SVR', SVR())]))) x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7) def modeling(models): for name, model in models: kfold = KFold(n_splits=10) results = cross_val_score(model, x_train, y_train, cv = kfold, scoring='r2') print(f'{name} = {results.mean()}')
Notice how we used Pipeline
while standardizing our models. We then created a function that used 10-fold cross validation to repeatedly train our models. Then, the result is displayed using R Squared metric.
modeling(pipelines)
ScaledLR = 0.6321641933826154
ScaledLASSO = 0.6321647820595134
ScaledEN = 0.4953062096224026
ScaledKNN = 0.7106787517028879
ScaledCART = 0.6207570733565403
ScaledSVR = -0.05047991785208246
The results show that KNN benefited from scaling the data. Letβs see if we can improve the result by tuning KNN parameters.
Tuning the Parameters
The default number of neighbors of KNN is 7, and with it KNN achieved good results. We will conduct a grid search to identify which parameters will yield an even greater score.
scaler = StandardScaler().fit(x_train) rescaledx = scaler.transform(x_train) k = list(range(1, 31)) kfold = KFold(n_splits=10) grid = GridSearchCV(model, param_grid=param_grid, cv = k, scoring='r2') grid_result = grid.fit(rescaledx, y_train) print(f'Best: {grid_result.best_score_} using {grid_result.best_params_}') # Best: 0.7242988300529242 using {'n_neighbors': 14}
The best for k is 14 with a mean score of 0.7243, slightly improved compared to the previous score.
Can we better this score? Yes, of course. Iβm aiming for 80% and above accuracy. In that case, we will try using ensemble methods.
Ensemble Methods
Letβs see what we can achieve using 4 different ensemble machine learning algorithms. Everything other than the models remains the same.
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor # ensembles ensembles = [] ensembles.append(('ScaledAB', Pipeline([('Scaler', StandardScaler()), ('AB', AdaBoostRegressor())]))) ensembles.append(('ScaledGBM', Pipeline([('Scaler', StandardScaler()), ('GBM', GradientBoostingRegressor())]))) ensembles.append(('ScaledRF', Pipeline([('Scaler', StandardScaler()), ('RF', RandomForestRegressor())]))) ensembles.append(('ScaledET', Pipeline([('Scaler', StandardScaler()), ('ET', ExtraTreesRegressor())]))) for name, model in ensembles: cv_results = cross_val_score(model, x_train, y_train, cv=kfold, scoring='r2') print(f'{name} = {cv_results.mean()}')
Output:
ScaledAB = 0.3835320642243155
ScaledGBM = 0.772428054038791
ScaledRF = 0.81023174859107
ScaledET = 0.7978581384771901
Random Forest Regressor achieved the highest score, and itβs what we are aiming for. Therefore, we are selecting the Random Forest Regressor algorithm to train and predict the price of a building. But can it do better than this? Sure, given that we trained only on default tuning parameters.
Here is the full code. Save it as model.py
.
import pandas as pd from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split, KFold, cross_val_score import pickle data = pd.read_csv('housing.csv') # select only 1000 rows data = data[:1000] # converting categorical column to int datatype label_encoder = LabelEncoder() obj = (data.dtypes == 'object') for col in list(obj[obj].index): data[col] = label_encoder.fit_transform(data[col]) # filling in missing values for col in data.columns: data[col] = data[col].fillna(data[col].mean()) # making data a numpy array like x = data.drop(['median_house_value'], axis=1) y = data.median_house_value x = x.values y = y.values # dividing data into train and test x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=7) # standardzing the data stds = StandardScaler() scaler = stds.fit(x_train) rescaledx = scaler.transform(x_train) # selecting and fitting the model for training model = RandomForestRegressor() model.fit(rescaledx, y_train) # saving the trained mode pickle.dump(model, open('rf_model.pkl', 'wb')) # saving StandardScaler pickle.dump(stds, open('scaler.pkl', 'wb'))
We selected only 1000 rows to reduce pickled size.
Notice that we saved the StandardScaler()
function to be used while creating the Streamlit dashboard. Since we scaled the dataset, we also expect to scale the input details from our users.
Streamlit Dashboard
Itβs now time to design our Streamlit app. Once again, we will try to keep things simple and avoid complex designs. Save the following code as app.py
.
import streamlit as st import pickle def main(): style = """<div style='background-color:pink; padding:12px'> <h1 style='color:black'>House Price Prediction App</h1> </div>""" st.markdown(style, unsafe_allow_html=True) left, right = st.columns((2,2)) longitude = left.number_input('Enter the Longitude in negative number', step =1.0, format="%.2f", value=-21.34) latitude = right.number_input('Enter the Latitude in positive number', step=1.0, format='%.2f', value= 35.84) housing_median_age = left.number_input('Enter the median age of the building', step=1.0, format='%.1f', value=25.0) total_rooms = right.number_input('How many rooms are there in the house?', step=1.0, format='%.1f', value=56.0) total_bedrooms = left.number_input('How many bedrooms are there in the house?', step=1.0, format='%.1f', value=15.0) population = right.number_input('Population of people within a block', step=1.0, format='%.1f', value=250.0) households = left.number_input('Poplulation of a household', step=1.0, format='%.1f',value=43.0) median_income = right.number_input('Median_income of a household in Dollars', step=1.0, format='%.1f', value=3000.0) ocean_proximity = st.selectbox('How close to the sea is the house?', ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND')) button = st.button('Predict') # if button is pressed if button: # make prediction result = predict(longitude, latitude, housing_median_age, total_rooms,total_bedrooms, population, households, median_income, ocean_proximity) st.success(f'The value of the house is ${result}')
We imported Streamlit and other libraries. Then we defined our main function. We want it to be executed as soon as we open the app. So, we will call the function using the __name__
variable at the very last of our script.
The unsafe_allow_html
makes it possible for the HTML tags to be executed by Python.
With st.columns
, we were able to display our variables side by side. We formatted each variable to be the same datatype in our dataset. If the button is pressed, then a callback function, the predict()
function, is executed.
π Recommended: Streamlit Button — A Helpful Guide
Letβs now define the predict()
function.
# load the train model with open('rf_model.pkl', 'rb') as rf: model = pickle.load(rf) # load the StandardScaler with open('scaler.pkl', 'rb') as stds: scaler = pickle.load(stds) def predict(longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, median_income, ocean_pro): # processing user input ocean = 0 if ocean_pro == '<1H OCEAN' else 1 if ocean_pro == 'INLAND' else 2 if ocean_pro == 'ISLAND' else 3 if ocean_pro == 'NEAR BAY' else 4 med_income = median_income / 5 lists = [longitude, latitude, housing_median_age, total_rooms, total_bedrooms, population, households, med_income, ocean] df = pd.DataFrame(lists).transpose() # scaling the data scaler.transform(df) # making predictions using the train model prediction = model.predict(df) result = int(prediction) return result
We started by loading the train model and StandardScaler
we saved earlier.
In the predict()
function, we use a ternary operator to turn user input into a number. More info about this operator in the referenced blog tutorial or this video:
Notice that we made sure it corresponds with the number assigned by LabelEncoder
. If you are ever in doubt, use the .value_counts()
method on the categorical column to confirm.
We divided the median_income
by 5 since the corresponding column in our dataset is said to be in tens of thousands of Dollars. However, this may not be necessary given that StandardScaler
finally scaled the data. We did it just to be on the safe side.
The double parentheses are our way of instructing Python to turn the given inputs into a DataFrame. We also made sure the order of the parameters in the predict()
function corresponds accordingly.
If the function seems to predict the same amount despite changes to the input details, then you may check the correlation the target variable has over the features by typing data.corr()
.
If we were to apply Recursive Feature Elimination (RFE) to select the best features capable of predicting the target variable, it would select just 4: longitude
, latitude
, median_income
, and ocean_proximity
. Let me show you what I mean.
from sklearn.feature_selection import RFE model = RandomForestRegressor() rfe = RFE(model) fit = rfe.fit(x,y) print(fit.n_features_) # 4 print(fit.support_) # array([ True, True, False, False, False, False, False, True, True]) print(fit.ranking_) # array([1, 1, 2, 6, 3, 5, 4, 1, 1])
Only 4 features are capable of predicting the target variable. If you kept getting the same amount, that may be the reason.
The purpose of this tutorial is purely educational, to demonstrate how to use Python to solve machine learning problems. I tried to keep things simple by not going through data visualization and feature engineering. Since the data is old, it should not be relied on when making important decisions.
We finally came to the end of the tutorial. Be sure to check my GitHub page to see the full project code.
To deploy on Streamlit Cloud, I assume you have already created a repository and added the required files. Then, you create an account on Streamlit Cloud, and input your repository URL. Streamlit will do the rest.
I have already deployed mine on Streamlit Cloud. Alright, enjoy your day.
π Recommended Project: How I Built and Deployed a Python Loan Eligibility Prediction App on Streamlit