In this tutorial, I will show you the steps I took to design a forecasting app and have it hosted on Streamlit Cloud. If you have been reading some of my tutorials you will notice that Streamlit has been my favorite web framework as far as data science is concerned.
This is because itโs easy to use. It has a shallow learning curve and does not require complex design to set up. Streamlit is a go-to for data scientists with little knowledge of web development. Of course, I have some knowledge of web development and will soon publish a tutorial on some projects I did using Django and other Python web development frameworks.
Introduction to Forecasting
To forecast means to predict the future values of data. Such a forecast gives an organization an idea of where its business is heading. Forecasting is done in virtually all aspects of human endeavors. Here are some example of real-world forecasting applications:
- Sales forecasts,
- energy consumption forecasts,
- birth rate predictions, and
- temperature forecasts.
Perhaps, the most common forecasts companies make are sales and price forecasts.
As humans, we yearn to know what will happen in the future. Unless you are going to meet a soothsayer, which of course I detest, you have to rely on machine learning to give you a near-accurate prediction to enable you to make informed decisions.
Time series analysis is one of the most widely used data science analyses. It involves analyzing data points ordered in time to check for underlying patterns that aid in forecasting.
Thanks to Pythonโs libraries, the forecasting process has been drastically reduced. So, in this tutorial, you will be using Pandas as well as Statsmodels, Scikit-learn, and other libraries to speed up the process.
Forecasting Models
There are many models we can implement to train our data for forecasting. We have Prophet from Facebook, Dart, ARIMA, Holt Winter, Exponential Smoothing, and many others. But in this tutorial, we will use the ARIMA model.
The ARIMA Model
ARIMA, short for Autoregressive Integrated Moving Average, is a statistical tool that relies on past values to predict future values. It is characterized by 3 terms:
p
โ the order of the Auto Regressive term, that is the lagged values of y to be used as the prediction.q
โ the order of the Moving Average term which uses past forecast errors.d
โ the minimum number of differencing.
Differencing is a method utilized to make a time series stationary. This goes without saying that a series has to be stationary before you can feed it into the ARIMA model.
Non-stationary data is time series data whose statistical properties, e.g., mean and variance, change over time. There are several methods to check if data is stationary or not. The most common and preferred method to use is the Augmented Dickey-Fuller test.
๐ก What Is the Augmented Dickey-Fuller Test? The augmented Dickey-Fuller (ADF) test is a statistical hypothesis test used to determine whether a time series is stationary or not. It is a variation of the Dickey-Fuller test that includes additional terms to account for autocorrelation and trend in the data. The test works by testing the null hypothesis that a unit root is present in the time series data, indicating non-stationarity. (1) If the p-value of the test is below a pre-specified significance level, typically 0.05, the null hypothesis is rejected and the time series is considered stationary. (2) If the p-value is above the significance level, the null hypothesis cannot be rejected, and the time series is considered non-stationary. The ADF test is commonly used in econometrics and financial analysis to test for the presence of trends in time series data.
We will learn how to use this tool from the statsmodels
library.
Creating Streamlit Dashboard
We started by creating what I call our main()
function which will be run when we open the app.
import streamlit as st def main(): st.sidebar.header('App Forecaster') st.sidebar.text('MAKE SURE THE DATA IS TIME SERIES') st.sidebar.text('WITH ONLY TWO COLUMNS INCLUDING DATE') option = st.sidebar.selectbox('How do you want to get the data?', ['url', 'file']) if option == 'url': url = st.sidebar.text_input('Enter a url') if url: dataframe(url) else: file = st.sidebar.file_uploader('Choose a file', type=['csv', 'txt']) if file: dataframe(file)
The function allows users to get the data either online or from a file.
Whichever way they choose, the file will be sent to the datafame()
function.
This is a forecasting app designed to forecast all time series data provided that the data is a time series object with only two columns. Python will complain if anything contrary is given.
Once the data is sent to the dataframe()
function, three radio buttons will appear.
from pandas import read_csv def dataframe(df): st.header('App Forecaster') data = read_csv(df, header=0, parse_dates=True, index_col=0) to_do = st.radio('SELECT WHAT YOU WOULD LIKE TO DO WITH THE DATA', ['Visualize', 'Check for stationary', 'Forecast']) if to_do == 'Visualize': data_visualization(data) elif to_do == 'Check for stationary': stationary_test(data) else: forecast_data(data)
The first is data visualization which comes with a draw
button. Once clicked, a line chart will be drawn.
def data_visualization(data): button = st.button('Draw') if button: st.line_chart(data)
No need to use Matplotlib as Streamlit has got us covered with just one line of code.
The next option is to check if the data is stationary.
def stationary_test(data): res = testing(data) st.text(f'Augmented Dickey_fuller Statistical Test: {res[0]} \ \np-values: {res[1]}') st.text('Critical values at different levels:') for k, v in res[4].items(): st.text(f'{k}:{v}') if res[1] > 0.05: st.text('Your data is non-stationary and is being transformed \ \nto a stationary time series data. ') if st.button('Check results'): data_transform(data) elif res[1] <= 0.05: st.text('Your data is stationary and is ready for training.')
I accomplished that using the Augmented Dickey-Fuller test, which I wrapped in a function and has it called and the result saved in the res
variable of the stationary_test()
function.
from statsmodels.tsa.stattools import adfuller def testing(df): return adfuller(df)
Thatโs it. Very simple.
The result shows that hypothesis testing was done. The null and alternate hypotheses were evaluated. If the p-value is less than 0.05 significant level, the 95% confidence interval, we reject the null hypothesis which indicates that the data is stationary.
If it is greater, the data will be sent to the data_transform()
function where the differencing method is used to convert it to stationary data.
import numpy as np def data_transform(df): df_log = np.log(df.iloc[:, 0]) df_diff = df_log.diff().dropna() res = testing(df) if res[1] < 0.05: st.line_chart(df_dff) st.write('1st order differencing') else: df_diff_2 = df_diff.diff().dropna() st.line_chart(df_diff_2) st.write('2nd order differencing') stationary_test(df_diff_2)
Sometimes, the first-order differencing did not do the job.
We then have to repeat the differencing and make sure we drop null values. Once, the data has been converted, the result will be drawn on a chart and displayed using the stationary_test()
function.
Remember, this is just to show you how to check for stationary data. We are not using the converted data to train the model because the ARIMA model will do the job for us.
The Forecasting Stage
Back to the radio button in the dataframe()
function, when our users select the forecast button, the forecast_data()
function will be executed.
def forecast_data(df): st.text('...searching for the optimum parameter') optimum_para(df) st.text('Enter the parameter with the lowest RMSE') p = st.number_input('The p term') q = st.number_input('The q term') d = st.number_input('The d term') period = st.number_input('Enter the next period(s) you want to forecast', value=7) button = st.button('Forecast') if button: model_forecast(df, p, q, d, period)
We first call on yet another function to display the results from grid-searching for the optimum ARIMA parameters. Below is the optimum_para()
function.
from statsmodels.tsa.arima.model import ARIMA from numpy import sqrt from sklearn.metrics import mean_squared_error def optimum_para(df): p_values = [0, 1, 2] d_values = range(0, 3) q_values = range(0,3) size = int(len(df) * .7) train, test = df[:size], df[size:] for p in p_values: for q in q_values: for d in d_values: order = (p,q,d) model = ARIMA(train, order=order).fit() preds = model.predict(start=len(train), end=len(train) + len(test)-1) error = sqrt(mean_squared_error(test, preds)) st.text(f'ARIMA {order} RMSE: {error}')
Here, we first split the data, taking 70% for training and the rest for testing.
Then, we loop over the parameters within a given range. We call on ARIMA to train the model in each iteration.
Finally, we find the errors using Root Mean Squared Error (RMSE). There are many metrics to use but this is the recommended one. The parameters with the lowest RMSE will be used.
๐ก What is the Root Mean Squared Error (RMSE)? Have you ever wondered how well a predictive model is performing? RMSE is a handy tool that tells you how far off your model's predictions are from the true values. It works by measuring the difference between the actual and predicted values for each data point, squaring them, and taking the average. The square root of this average gives the RMSE, which is a way to understand how much your model's predictions deviate from reality. The lower the RMSE, the better the model fits the data. So, if you want to know how well your model is doing, calculating the RMSE is a good place to start!
The number_input
in the forecast_data()
is for our users to input the 3 ARIMA parameters with the lowest RMSE. The reason for this is that I made this app to accommodate all time series forecasting, whether it is sales, weather, or any other time series data. As far as it comes in the aforementioned CSV pattern, it will make predictions for you.
This is why I didnโt save the trained model, as we donโt know what time series data our user will use.
Can you see how the data is moving from one function to the other and getting the job done? Letโs now code our final callback function which will be called once the forecast button is clicked.
def model_forecast(data, p, q, d, period): size = int(len(data) * .7) train, test = data[:size], data[size:] model = ARIMA(train.values, order=(p,q,d)) model_fit = model.fit() output = model_fit.predict(start=len(train), end=len(train)+len(test)-1) error = sqrt(mean_squared_error(output, test)) st.text(f'MAE using {p,q,d}: {error}') st.text(f'Forecasting {period} future values') model_2 = ARIMA(data.values, order =(p,q,d)).fit() forecast = model_2.predict(start=len(data), end=len(data)+period, typ='levels') day = 1 for i in forecast: st.text(f'Period {day}: {i}') day += 1
We repeated what we did before using the best parameters just to make sure we get the same RMSE.
We then train the entire data.
The period
parameter can be days, weeks, months, or years based on our userโs data. We then have to include it in the predict()
method. It will make forecasts for the future values of the given period.
Finally, we use st.text
to display the results.
Conclusion
I appreciate your time spent with me so far in this tutorial.
No doubt, you have learned something that gives you an idea of what you can build with Python.
You can find the full code on my GitHub page. I have this app already hosted on Streamlit Cloud.
Check it out and share it with others. Enjoy your day.