Python Time Series Forecast - A Guided Example on Bitcoin Price Data

A Time Series is essentially a tabular data with the special feature of having a time index.

The common forecast task is ‘knowing the past (and sometimes the present), predict the future’. This task, taken as a principle, reveals itself in several ways:

in how to interpret your problem,
in feature engineering, and
in which forecast strategy to take.

💡 The aim of the first article in this series is to present particular feature engineering associated with time series, with explicit functions to be added to your workflow. In the next article, we will discuss seasonality and strategies for multi-step forecasting.

For more information and different approaches to time series, we refer to Kaggle’s Time Series Crash Course and ML Mastery’s Blog from where most of my inspiration comes from.

You can find a Jupyter Notebook with all the code at the end of this article.

🚫 Disclaimer: This article is a programming/data analysis tutorial only and is not intended to be any kind of investment advice.

Setting Up Our Case Study: Loading the Data

We will deal with Bitcoin data. Cryptocurrency prices are wild animals and hard to predict, therefore, a main issue here is collecting alternative datasets.

To instantiate this principle, we load (free) sentiment analysis data in addition to the BTC-USD price.

Yahoo! Finance API

Ran Aroussi, a senior coder, rewrote Yahoo!’s decommissioned finance API – a great service to the (learning-finances part of) humanity.

Downloading financial data is then made in simple steps:

Choose your ticker;
Choose a start date, end date (both in 'YYYY-MM-DD' format), and frequency of data
Type:

import yfinance as yf
data = yf.download(ticker, start=start_date, end=end_date, interval=frequency)

The returned ‘data‘ is now a pandas DataFrame with DatetimeIndex and columns ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume'].

We will use 'BTC-USD' data from '2020-02-14' to '2022-09-21' with daily frequency:

import pandas as pd
import yfinance as yf

data = yf.download('BTC-USD', 
                   start='2020-02-14', 
                   end='2022-09-21', 
                   interval='1d'
                  )

data

Index comes date-parsed and ready to use:

data.index

(DatetimeIndex!!! 😍)

We focus in the ‘Close‘ column and will ignore all others, although the ‘High‘ and ‘Low‘ could be of help when we will be trying to forecast ‘Close‘.

SentiCrypt Sentiment Analysis API

The second piece of information we gather is sentiment analysis.

👉 We will use the one from SentiCrypt API, whose work I really appreciate. With each request there, you can get a file with twice-per-minute data from sentiment analysis ‘of a stream of cryptocurrency related text data mined from the internet’ – pretty cool.

Here is an example on how to get and handle their data:

import requests


r = requests.get(f'http://api.senticrypt.com/v1/history/bitcoin-2022-09-22_08.json')

sentic = pd.DataFrame(r.json())
sentic.timestamp = pd.to_datetime(sentic.timestamp, unit='s')
sentic.set_index('timestamp',inplace=True)

sentic

As you can see, their database is quite rich and offers up to five new features to our dataset.

Since we will deal with daily-basis though, we will use the mean of their 24h data (see line 9 in the code below).

The importation and mean of the sentiment data is done with the code below:

import csv
import time
columns = ['mean', 'sum', 'last', 'count', 'rate', 'median']
with open('sentic.csv', 'w') as file:
    writer = csv.writer(file)
    for date in  data.index:
        r = requests.get(f'http://api.senticrypt.com/v1/history/bitcoin-{date.date()}_08.json')
        sentic= pd.DataFrame(r.json())
        row = [date.date()]+sentic[columns].mean(numeric_only=True).to_list()+[sentic['last'][sentic['last']<=0].mean()]
        writer.writerow(row)
        print(f'Completed {date.date()}')
        time.sleep(0.4)

The output will be a series of prints 'Completed YYYY-MM-DD', so you have some sense of the download progress.

💡 Remark: If you have a fast internet connection, I kindly request to keep the time.sleep(0.4) inside the for loop, as it is in the code, since I don’t want to break their site. If you are not that fortunate with fast internet (as it is in my case), just kill the last line.

At the end of the day, you will have a new CSV file to load as a DataFrame:

sentic = pd.read_csv('sentic.csv', 
                     index_col=0, 
                     parse_dates=True, 
                     names=['mean', 'sum', 'last', 'count', 'rate', 'median', 'neg_median'] 
                     )

The last step I suggest before crunching the data is to merge the financial and the sentiment DataFrames.

For readability we also lower case the columns’ names:

df = pd.merge(data[['Close', 'Volume']], sentic, left_index=True, right_index=True, how='inner')
df.columns = df.columns.str.lower()

All said, you can always do your own scrapping/sentiment analysis by following these great articles:

Getting Lags

The simplest possible feature one can consider is a lag. That is, past values.

The first lag of the close value at index '2022-02-01' is the previous close value, at '2022-01-31'.

In other words, you would try to predict today knowing yesterday. One can go steps further and consider the nth lag, i.e., the value n days before.

💡 In Pandas, the lag is recovered with the .shift() method. It assumes one integer parameter measuring how many steps one wants.

For example, the three first lags of the ‘close‘ column will be given by:

close_lag_1 = df.close.shift(1)
close_lag_2 = df.close.shift(2)
close_lag_3 = df.close.shift(3)

close_lags = pd.concat([close_lag_1,close_lag_2,close_lag_3], axis=1)
close_lags

Notice that the shift operator will naturally create NaN values in the first rows. That is because there is no data prior to the first dates to fill in these cells.

However, since we have enough data ahead, we will drop these columns.

In any case, you might want to change the name of the resulting columns as well:

close_lags.columns = [f'close_lag_{i}' for i in range (1,4)]

The most naive forecast (also known as the naive forecast or persistence forecast) is given by assuming that tomorrow’s value will be the same as today.

Let us see how well it succeeds with Bitcoin:

diff = close_lag_1-df.close
diff.plot()

Pretty wild, isn’t it? We can even compute their mean absolute error by importing the relevant function from sklearn:

from sklearn.metrics import mean_absolute_error

print(mean_absolute_error(df.close[1:], close_lag_1[1:]))
# 857.8280107761459

We conclude that the average error of the persistence forecast is around 858 USD, or around 25,000 USD per month (way more than what I can afford!)

In spite of that, an interesting takeaway is that Bitcoin closing price, on average, changes 858 USD per day.

To keep the workflow promise, let us write a function to get the lag of a pandas.Series:

def make_lags(df, n_lags=1, lead_time=1):
    """
    Compute lags of a pandas.Series from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags.
    Returns a pd.DataFrame whose ith column is either the i+lead_time lag or the ith element of n_lags.
    """
    if isinstance(n_lags,int):
        lag_list = list(range(lead_time, n_lags+lead_time)
    else:
        lag_list = n_lags
    lags ={
        f'{df.name}_lag_{i}': df.shift(i) for i in lag_list
        },
        
    return pd.concat(lags, axis=1)

▶️ Scroll to the beginning of this article and watch the video for the construction and detailed explanation of the function.

Of course, not all lags are to be used, and many could be self-related, as cascading effects.

For example, the effect captured by the third lag could be already instantiated in the second lag.

To figure that out, we import the special function plot_pacf from statsmodels. It takes an array of time-series values and the number of desired lags in order to return a plot of the partial autocorrelation between the lags and the present value.

An example follows below:

from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(df.close, lags=20)
plt.show()

From the plot, we see that the first lag is highly correlated to the present closing price. In addition, the 10th and 20th lags have significant correlation (more than 5%).

It is natural to expect the first lag correlation however the other two are quite surprising. In this case, one should watch out for spurious correlations and analyze more closely what is going one case by case.

We will do that later in the article.

Remarks and Warnings: Multivariate Series and Lookahead

Often we have more than one feature available to help with predictions.

In our example, we want to predict Bitcoin’s closing price, but we also have the ‘Volume’, ‘Open’, ‘High’ and ‘Low’ columns. All features you have can be taken into account, however one should be careful not to look-ahead.

For example, if it is 8AM today and you want to predict Bitcoin’s price for tomorrow, then today’s Bitcoin’s closing price is not an available feature, just because you do not have it!

However, we can use yesterday’s closing value. The same for ‘Volume’ or any other feature.

💡 So, be careful: time series features must be properly lagged.

In the present scenario, we could use data up to, say, 7:59AM, assuming our data collection process takes less than one minute.

In other scenarios, say you are assessing the health of patients in a hospital, the data collection could take a whole day or two.

The time that elapses between running the algorithm and the first value you need to forecast is called lead time. If the lead time is greater than one, you must consider shifts greater than one.

The opposite situation is when you have future data (I’m not talking about these guys, calm down): the number of jackets your store will arrive for next week’s stock, the items that will be on sale, or how many cars Tesla plans to produce tomorrow (as long as it impacts your target’s value) are often predetermined values and you can use future values as features.

These are called leads. They are realized by a negative shift. Below follows a simple function to make leads (you are welcome to adapt the make_lags, if you need a more sophisticated version):

def make_leads(df, n_leads=1):
    """
    Compute the first n_leads leads of a pandas.Series. 
    Returns a pd.DataFrame whose ith column is the ith lead.
    """

    leads ={
        f'{df.name}_lead_{i}': df.shift(-i)
        for i in range(1, n_leads + 1)
        },
        
    return pd.concat(leads, axis=1)

Finally, we might want to apply the functions we defined in more than one column at a time.

We provide a function wrapper to this goal:

def multi_wrapper(function, df: pd.DataFrame, columns_list:list =None, drop_original=False, get_coordinate:int =None, **kwargs)->pd.DataFrame:
    
    if columns_list is None:
        columns_list = df.columns
    
    X_list = list(range(len(columns_list)))
    
    if get_coordinate is None:
        for i in range(len(columns_list)):
            X_list[i] = function(df.iloc[:,i], **kwargs)
    else:
        for i in range(len(columns_list)):
            X_list[i] = function(df.iloc[:,i], **kwargs)[get_coordinate]
            
    if drop_original:
        X_list[i] = X_list[i].iloc[:,1:]
    
    XX=pd.concat(X_list,axis=1)
    
    return XX

Check out the video to know why every detail is there.

Trends as features

There is a lot more one can do with time series, and we will follow with Trends.

Let’s stick to definition 2.

In practice, such a movement is well expressed with the rolling method in Pandas.

Below we compute the four weeks average of the Bitcoin price, meaning, every point in our new column is the average of the last four weeks’ price:

n_window = 4
close_4wtrend = data.close.rolling(window=n_window).mean()

The rolling method creates a Window object, very similar to a GroupBy one. It returns an iterable whose each item is a pandas.Series comprising the last n_window observations:

for item in df.close.rolling(window=4):
    print(item)
    print(type(item))

You can operate over it in the same fashion you would do with a GroupBy object.

The methods .mean(), .median(), .min(), .max() will return the mean, median, min and max of each Series.

You can even apply all of them together by using a dictionary inside the .agg() method:

close_4wtrend = df.close.rolling(window=n_window)\
                        .agg({
                              '4w_avg':'mean', 
                              '4w_median':'median', 
                              '4w_min':'min', 
                              '4w_max':'max'
})


display(close_4wtrend)

close_4wtrend.plot()
df.close.plot(legend=True)

Since we have too many rows in the dataset, we cannot see much of the new lines if we do not zoom in.

Next, we focus on this year’s January and highlight the ‘close’ line by increasing its thickness:

close_4wtrend.plot(xlim=('2022-01-01', '2022-02-01'))  
df.close.plot(legend=True, xlim=('2022-01-01', '2022-02-01'), linewidth=3, color='black') 

# (DatetimeIndex to the rescue! :))

Better now? Try changing the window to 12 and keeping only max, min, for example.

For modeling purposes, one should keep in mind that a 4 weeks average is a linear function on the present value and its first three lags. Machine Learning algorithms usually succeed well in detecting linear correlations. If you want to add new information to the features, min, max, median or Exponential Moving Average might be better options.

A myriad of Window/Rolling options are described in Pandas documentation. We shall explore some of them in a later article.

One also might want to find simple models for long-term trends. Why and how will be discussed in the next article, together with Seasonality.

As the purpose of this article is a workflow, let us write a function to apply rolling.

def make_trends(series, n_window=None, window_type='rolling', function_list:list = ['mean'], **window_kwargs):
    window = getattr(series, window_type)
    
    function_dict = { (f'{series.name}_{window_type}_{foo}' if isinstance(foo,str) 
                             else f'{series.name}_{window_type}_{foo.__name__}'):foo   
                             for foo in function_list}
 
    if n_window is None:
        full_trend =  window(**window_kwargs).agg(function_dict)
    else:
        full_trend =  window(window=n_window, **window_kwargs).agg(function_dict)

    return full_trend

Again, we refer to the video for the function’s construction and a line-by-line explanation.

Finally, a less direct application of rolling windows is to analyze lags’ partial autocorrelation.

We accomplish that by wrapping the respective function from statsmodel, in order to return only the 10th lag, and using our handmade make_trends function:

from statsmodels.tsa.stattools import pacf

def pacf10(series):
    return pacf(series,nlags=10)[10]

df_pacf10 = make_trends(df.close, n_window=120, function_list=[pacf10])

(np.abs(df_pacf10.sort_values('pacf10'))>.1).mean()
# pacf10    0.422105
# dtype: float64

ax = df_pacf10.plot()
ax.hlines(y=[0.1,-0.1], xmin=df.close.index[0], xmax=df.close.index[-1], color='red')

From the numbers and the graph we can conclude that the 10th lag correlation might be significant: in more than 40% of the windows there is at least 10% of correlation between the two values.

Does the Bitcoin price really have a tendency to change its direction each ten days?

(Are you as surprised as I am? 😱😱😱😱)

Main Takeaways

Retrieve financial and sentiment data from yfinance and SentiCrypt API;
Lags and leads are the most common features in a time series. But one should be careful with the scope of its data: you cannot use as feature data you will not have by the moment of prediction;
A variety of trends can be used as features. Nonlinear trends (such as max, min and ExponentialMovingWidow) can be especially useful to train ML models

We will follow in the next article by discussing seasonality, multi-step models, and why you do not want trends to be in the training data.

Try It Yourself

You can run this code in the Jupyter Notebook (Google Colab) here:

https://colab.research.google.com/drive/1-bOX8l89_HfflnoY9owD9JOJBlhQILen?usp=sharing

👉 Recommended Tutorial: Check out Part II of this series on the Bitcoin pricing prediction data.