Python Time Series Forecast on Bitcoin Data (Part II)

A Time Series is essentially a tabular data with the special feature of having a time index. The common forecast taks is ‘knowing the past (and sometimes the present), predict the future’. This task, taken as a principle, reveals itself in several ways: in how to interpret your problem, in feature engineering and in which forecast strategy to take.

This is the second article in our series. In the first article we discussed how to create features out of a time series using lags and trends. Today we follow the opposite direction by highlighting trends as something you want directly deducted from your model.

Reason is, Machine Learning models work in different ways. Some are good with subtractions, others are not.

For example, for any feature you include in a Linear Regression, the model will automatically detect whether to deduce it from the actual data or not. A Tree Regressor (and its variants) will not behave in the same way and usually will ignore a trend in the data.

Therefore, whenever using the latter type of models, one usually calls for a hybrid model, meaning, we use a Linear(ish) first model to detect global periodic patterns and then apply a second Machine Learning model to infer more sophisticated behavior.

We use the Bitcoin Sentiment Analysis data we captured in the last article as a proof of concept.

The hybrid model part of this article is heavily based on Kaggle’s Time Series Crash Course, however, we intend to automate the process and discuss more in-depth the DeterministicProcess class.

Trends, as something you don’t want to have

(Or that you want it deducted from your model)

An aerodynamic way to deal with trends and seasonality is using, respectively, DeterministicProcess and CalendarFourier from statsmodel. Let us start with the former.

DeterministicProcess aims at creating features to be used in a Regression model to determine trend and periodicity. It takes your DatetimeIndex and a few other parameters and returns a DataFrame full of features for your ML model.

A usual instance of the class will read like the one below. We use the sentic_mean column to illustrate.

from statsmodels.tsa.deterministic import DeterministicProcess

y = dataset['sentic_mean'].copy()

dp = DeterministicProcess(
index=y.index, 
constant=True, 
order=2
)

X = dp.in_sample()

X

We can use X and y as features and target to train a LinearRegression model. In this way, the LinearRegression will learn whatever characteristics from y can be inferred (in our case) solely out of:

the number of elapsed time intervals (trend column);
the last number squared (trend_squared); and
a bias term (const).

Check out the result:

from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X,y)

predictions = pd.DataFrame(
                    model.predict(X),
                    index=X.index,
                    columns=['Deterministic Curve']
)

Comparing predictions and actual values gives:

import matplotlib.pyplot as plt

plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
plt.show()

Even the quadratic term seems ignorable here. The DeterministicProcess class also helps us with future predictions since it carries a method that provides the appropriate future form of the chosen features.

Specifically, the out_of_sample method of dp takes the number of time intervals we want to predict as input and generates the needed features for you.

We use 60 days below as an example:

X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        model.predict(X_out),
                        index=X_out.index,
                        columns=['Future Predictions']
)



plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()

Let us repeat the process with sentic_count to have a feeling of a higher-order trend.

👍 As a rule of thumb, the order should be one plus the total number of (trending) hills + peaks in the graph, but not much more than that.

We choose 3 for sentic_count and compare the output with the order=2 result (we do not write the code twice, though).

y = dataset['sentic_count'].copy()

from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier


dp = DeterministicProcess(
    index=y.index, constant=True, order=3
)
X = dp.in_sample()


model = LinearRegression().fit(X,y)

predictions = pd.DataFrame(
                    model.predict(X),
                    index=X.index,
                    columns=['Deterministic Curve']
)


X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        model.predict(X_out),
                        index=X_out.index,
                        columns=['Future Predictions']
)



plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()

Although the order-three polynomial fits the data better, use discretion in deciding whether the sentiment count will decrease so drastically in the next 60 days or not. Usually, trust short-time predictions rather than long ones.

DeterministicProcess accepts other parameters, making it a very interesting tool. Find a description of the almost full list below.

dp = DeterministicProcess(
    index, # the DatetimeIndex of your data
    period: int or None, # in case the data shows some periodicity, include the size of the periodic cycle here: 7 would mean 7 days in our case
    constant: bool, # includes a constant feature in the returned DataFrame, i.e., a feature with the same value for everyone. It returns the equivalent of a bias term in Linear Regression 
    order: int, # order of the polynomial that you think better approximates your trend: the simplest the better
    seasonal: bool, # make it True if you think the data has some periodicity. If you make it True and do not specify the period, the dp will try to infer the period out of the index
    additional_terms: tuple of statsmodel's DeterministicTerms, # we come back to this next
    drop: bool # drops resulting features which are collinear to others. If you will use a linear model, make it True
)

Seasonality

As a hardened Mathematician, seasonality is my favorite part because it deals with Fourier analysis (and wave functions are just… cool!):

Do you remember your first ML course when you heard Linear Regression can fit arbitrary functions, not only lines? So, why not a wave function? We just did it for polynomials and didn’t even feel like it 😉

In general, for any expression f which is a function of a feature or of your DatetimeIndex, you can create a feature column whose ith row is the value of f corresponding to the ith index.

Then linear regression finds the constant coefficient multiplying f that best fits your data. Again, this procedure works in general, not only with Datetime indexes – the trend_squared term above is an example of it.

For seasonality, we use a second statsmodel‘s amazing class: CalendarFourier. It is another statsmodel‘s DeterministicTerm class (i.e., with the in_sample and out_of_sample methods) and instantiates with two parameters, 'frequency' and 'order'.

As a 'frequency', the class expects a string such as ‘D’, ‘W’, ‘M’ for day, week or month, respectively, or any of the quite comprehensive Pandas Datetime offset aliases.

The 'order' is the Fourier expansion order which should be understood as the number of waves you are expecting in your chosen frequency (count the number of ups and downs – one wave would be understood as one up and one down)

CalendarFourier integrates swiftly with DeterministicProcess by including an instance of it in the list of additional_terms.

Here is the full code for sentic_mean:

from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier

y = dataset['sentic_mean'].copy()

fourier = CalendarFourier(freq='A',order=2)

dp = DeterministicProcess(
    index=y.index, constant=True, order=2, seasonal=False, additional_terms=[fourier], drop=True
)
X = dp.in_sample()


from sklearn.linear_model import LinearRegression

model = LinearRegression().fit(X,y)

predictions = pd.DataFrame(
                    model.predict(X),
                    index=X.index,
                    columns=['Prediction']
)


X_out = dp.out_of_sample(60)

predictions_out = pd.DataFrame(
                        model.predict(X_out),
                        index=X_out.index,
                        columns=['Prediction']
)



plt.figure()
ax = plt.subplot()
y.plot(ax=ax, legend=True)
predictions.plot(ax=ax)
predictions_out.plot(ax=ax, color='red')
plt.show()

If we take seasonal=True inside DeterministicProcess, we get a crispier line:

Including ax.set_xlim(('2022-08-01', '2022-10-01')) before plt.show() zooms the graph in:

Although I suggest using the seasonal=True parameter with care, it does find interesting patterns (with huge RMSE error, though).

For instance, look at this BTC percentage change zoomed chart:

Here period is set to 30 and seasonal=True. I also manually rescaled the predictions to be better visible in the graphic. Although the predictions are far away from truth, thinking as a trader, isn’t it impressive how many peaks and hills it gets right? At least for this zoomed month…

To maintain the workflow promise, I prepared a code that does everything so far in one shot:

def deseasonalize(df: pd.Series, season_freq='A', fourier_order=0, 
                    constant=True, dp_order=1,  dp_drop=True,
                    model=LinearRegression(),
                    fourier=None,
                    dp=None,
                    **DeterministicProcesskwargs)->(pd.Series, plt.Axes, pd.DataFrame):
    """
    Returns a deseasonalized and detrended df, a seasonal plot, and the fitted DeterministicProcess instance.
    """
    
    if fourier is None:
        fourier = CalendarFourier(freq=season_freq, order=fourier_order)
    
    if dp is None:
        dp = DeterministicProcess(
        index=df.index,
        constant=True,
        order=dp_order,
        additional_terms=[fourier],
        drop=dp_drop,
        **DeterministicProcesskwargs
        )
    
    X = dp.in_sample()
    model = LinearRegression().fit(X, df)
    y_pred = pd.Series(
                        model.predict(X),
                        index=X.index,
                        name=df.name+'_pred'
    )

    ax = plt.subplot()
    y.plot(ax=ax, legend=True)
    predictions.plot(ax=ax)
    
    y_pred.columns = df.name
    y_deseason = df - y_pred
    y_deseason.name = df.name +'_deseasoned'
    return y_deseason, ax, dp


The sentic_mean analyses get reduced to:

y_deseason, ax, dp=    deseasonalize(y, 
        season_freq='A',
        fourier_order=2,
        constant=True,
        dp_order=2,
        dp_drop=True,
        model=LinearRegression() )

Cycles and Hybrid Models

Let us move on to a complete Machine Learning prediction. We use XGBRegressor and compare its performance among three instances:

Predict sentic_mean directly using lags;
Same prediction adding the seasonal/trending with a DeterministicProcess;
A hybrid model, using LinearRegression to infer and remove seasons/trends, and then apply a XGBRegressor.

The first part will be the bulkier since the other two follow from simple modifications in the resulting code.

Preparing the data

Before any analysis, we split the data in train and test sets. Since we are dealing with time series, this means we set the ‘present date’ as a point in the past and try to predict its respective ‘future’. Here we pick 22 days in the past.

s = dataset['sentic_mean']

s_train = s[:'2022-09-01']

We made this first split in order to not leak data while doing any analysis.

Next, we prepare target and feature sets. Recall our SentiCrypto’s data was set to be available everyday at 8AM. Imagine we are doing the prediction by 9AM.

In this case, anything until the present data (the ‘lag_0‘) can be used as features, and our target is s_train‘s first lead (which we define as a -1 lag). To choose other lags as features, we examine theirs statsmodel’s partial auto-correlation plot:

from statsmodels.graphics.tsaplots import plot_pacf

plot_pacf(s_train, lags=20)

We use the first four for sentic_mean and the first seven + the 11th for sentic_count (you can easily test different combinations with the code below.)

Now we finish choosing features, we go back to the full series for engineering. We apply to s_maen and s_count the make_lags function we defined in the last article (which we transcribe here for convenience).

def make_lags(df, n_lags=1, lead_time=1):
    """
    Compute lags of a pandas.Series from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags.
    Returns a pd.DataFrame whose ith column is either the i+lead_time lag or the ith element of n_lags.
    """
    if isinstance(n_lags,int):
        lag_list = list(range(lead_time, n_lags+lead_time))
    else:
        lag_list = n_lags
    lags ={
        f'{df.name}_lag_{i}': df.shift(i) for i in lag_list
        }
    
    return  pd.concat(lags,axis=1)

X = make_lags(s, [0,1,2,3,4])

y = make_lags(s, [-1])

display(X)
y

Now a train-test split with sklearn is convenient (Notice the shuffle=False parameter, that is key for time series):

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=22, shuffle=False)

X_train

(Observe that the final date is set correctly, in accordance with our analysis’ split.)

Applying the regressor:

xgb = XGBRegressor(n_estimators=50)

xgb.fit(X_train,y_train)

predictions_train = pd.DataFrame(
                    xgb.predict(X_train),
                    index=X_train.index,
                    columns=['Prediction']
)


predictions_test = pd.DataFrame(
                    xgb.predict(X_test),
                    index=X_test.index,
                    columns=['Prediction']
)

print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}')

plt.figure()
ax = plt.subplot()
y_train.plot(ax=ax, legend=True)
predictions_train.plot(ax=ax)
plt.show()

plt.figure()
ax = plt.subplot()
y_test.plot(ax=ax, legend=True)
predictions_test.plot(ax=ax)
plt.show()

print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}')

You can reduce overfitness by reducing the number of estimators, but the R2 test score maintains negative.

We can replicate the process for sentic_count (or whatever you want). Below is a function to automate it.

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from statsmodels.tsa.stattools import pacf




def apply_univariate_prediction(series,  test_size, to_predict=1, nlags=20, minimal_pacf=0.1, model=XGBRegressor(n_estimators=50)):
    '''
    Starting from series, breaks it in train and test subsets; 
    chooses which lags to use based on pacf > minimal_pacf; 
    and applies the given sklearn-type model. 
    Returns the resulting features and targets and the trained model. 
    It plots the graph of the training and prediction, together with their r2_score.
    '''
    s = series.iloc[:-test_size]
    
    if isinstance(to_predict,int):
        to_predict = [to_predict]
    
    from statsmodels.tsa.stattools import pacf

    s_pacf = pd.Series(pacf(s, nlags=nlags))
    
    column_list = s_pacf[s_pacf>minimal_pacf].index
    
    X = make_lags(series, n_lags=column_list).dropna()
    
    y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
    
    model.fit(X_train,y_train)

    predictions_train = pd.DataFrame(
                        model.predict(X_train),
                        index=X_train.index,
                        columns=['Train Predictions']
    )


    predictions_test = pd.DataFrame(
                        model.predict(X_test),
                        index=X_test.index,
                        columns=['Test Predictions']
    )


    fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True)

    y_train.plot(ax=ax1, legend=True)
    predictions_train.plot(ax=ax1)
    ax1.set_title('Train Predictions')

    y_test.plot(ax=ax2, legend=True)
    predictions_test.plot(ax=ax2)
    ax2.set_title('Test Predictions')
    plt.show()


    print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}')

    print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}')
    
    return X, y, model

apply_univariate_prediction(dataset['sentic_count'],22)

apply_univariate_prediction(dataset['BTC-USD'], 22)

Predicting with Seasons

Since the features created by DeterministicProcess are only time-dependent, we can add them harmlessly to the feature DataFrame we automated get from our univariate predictions.

The predictions, though, are still univariate. We use the deseasonalize function to obtain the season features. The data preparation is as follows:

s = dataset['sentic_mean']

X, y, _ = apply_univariate_prediction(s,22);

s_deseason, _, dp = deseasonalize(s, 
        season_freq='A',
        fourier_order=2,
        constant=True,
        dp_order=2,
        dp_drop=True,
        model=LinearRegression() );
X_f = dp.in_sample().shift(-1)

X = pd.concat([X,X_f], axis=1, join='inner').dropna()

With a bit of copy and paste, we arrive at:

And we actually perform way worse! 😱

Deseasonalizing

Nevertheless, the right-hand graphic illustrates the inability of grasping trends. Our last shot is a hybrid model.

Here we follow three steps:

We use the LinearRegression to capture the seasons and trends, rendering the series y_s. Then we acquire a deseasonalized target y_ds = y-y_s;
Train an XGBRegressor on y_ds and the lagged features, resulting in deseasonalized predictions y_pred;
Finally, we incorporate y_s back to y_pred to compare the final result.

Although Bitcoin-related data are hard to predict, there was a huge improvement on the r2_score (finally something positive!). We define the used function below.

get_hybrid_univariate_prediction(dataset['sentic_mean'], 22,
                                     season_freq='A', 
                                     fourier_order=2, 
                                     constant=True, 
                                     dp_order=2,  
                                     dp_drop=True,
                                     model1=LinearRegression(),
                                     fourier=None, is_seasonal=True, season_period=7,
                                     dp=None, 
                                     to_predict=1, 
                                     nlags=20, 
                                     minimal_pacf=0.1, 
                                     model2=XGBRegressor(n_estimators=50)
                                         
                    )

Instead of going through every detail, we will also automate this code. In order to get the code running smoothly, we revisit the deseasonalize and the apply_univariate_prediction functions in order to remove the plotting part of them.

The final function only plots graphs and returns nothing. It intends to give you a baseline for a hybrid model score. Change the function at will to make it return whatever you need.

def get_season(series: pd.Series, 
               test_size,
               season_freq='A', 
               fourier_order=0, 
               constant=True, 
               dp_order=1,  
               dp_drop=True,
               model1=LinearRegression(),
               fourier=None,
               is_seasonal=False,
               season_period=None,
               dp=None):
    """
    Decompose series in a deseasonalized and a seasonal part. The parameters are relative to the fourier and DeterministicProcess used. 
    Returns y_ds and y_s.

    """

    se = series.iloc[:-test_size]
    
    if fourier is None:
        fourier = CalendarFourier(freq=season_freq, order=fourier_order)

    if dp is None:
        dp = DeterministicProcess(
        index=se.index,
        constant=True,
        order=dp_order,
        additional_terms=[fourier],
        drop=dp_drop,
        seasonal=is_seasonal,
        period=season_period
        )

    X_in = dp.in_sample()
    X_out = dp.out_of_sample(test_size)
    
    model1 = model1.fit(X_in, se)
    
    X = pd.concat([X_in,X_out],axis=0)

    
    y_s = pd.Series(
                        model1.predict(X),
                        index=X.index,
                        name=series.name+'_pred'
    )

    y_s.name = series.name
    y_ds = series - y_s
    y_ds.name = series.name +'_deseasoned'
    return y_ds, y_s




def prepare_data(series, 
              test_size, 
              to_predict=1, 
              nlags=20,
              minimal_pacf=0.1):
    '''
    Creates a feature dataframe by making lags and a target series by a negative to_predict-shift. 
    Returns X, y.
    '''
    s = series.iloc[:-test_size]

    if isinstance(to_predict,int):
        to_predict = [to_predict]

    from statsmodels.tsa.stattools import pacf

    s_pacf = pd.Series(pacf(s,nlags=nlags))

    column_list = s_pacf[s_pacf>minimal_pacf].index

    X = make_lags(series, n_lags=column_list).dropna()

    y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index].squeeze()


    return X, y
    
    
def get_hybrid_univariate_prediction(series: pd.Series, 
                                     test_size,
                                     season_freq='A', 
                                     fourier_order=0, 
                    constant=True, 
                                     dp_order=1,  
                                     dp_drop=True,
                    model1=LinearRegression(),
                    fourier=None,
                                     is_seasonal=False,
                                     season_period=None,
                                         dp=None, 
                                         to_predict=1, 
                                         nlags=20, 
                                         minimal_pacf=0.1, 
                                         model2=XGBRegressor(n_estimators=50)
                                         
                    ):
    """
    Apply the hybrid model method by deseasonalizing/detrending a time series with model1 and investigating the resulting series with model2. It plots the respective graphs and computes r2_scores. 
    """
    
    
    
    y_ds, y_s = get_season(series, test_size, season_freq=season_freq, fourier_order=fourier_order, constant=constant, dp_order=dp_order, dp_drop=dp_drop, model1=model1, fourier=fourier, dp=dp, is_seasonal=is_seasonal, season_period=season_period)
    
    X, y_ds = prepare_data(y_ds,test_size=test_size)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y_ds, test_size=test_size, shuffle=False)
    
    y = y_s.squeeze() + y_ds.squeeze()
        
    model2 = model2.fit(X_train,y_train)

    predictions_train = pd.Series(
                        model2.predict(X_train),
                        index=X_train.index,
                        name='Prediction'
    )+y_s[X_train.index]


    predictions_test = pd.Series(
                        model2.predict(X_test),
                        index=X_test.index,
                        name='Prediction'
    )+y_s[X_test.index]

    fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True)

    y_train_ps = y.loc[y_train.index]
    y_test_ps = y.loc[y_test.index]

    y_train_ps.plot(ax=ax1, legend=True)
    predictions_train.plot(ax=ax1)
    ax1.set_title('Train Predictions')

    y_test_ps.plot(ax=ax2, legend=True)
    predictions_test.plot(ax=ax2)
    ax2.set_title('Test Predictions')
    plt.show()


    print(f'R2 train score: {r2_score(y_train_ps[:-to_predict],predictions_train[:-to_predict])}')

    print(f'R2 test score: {r2_score(y_test_ps[:-to_predict],predictions_test[:-to_predict])}')

A note of warning: if you do not expect your data to follow time patterns, do focus on cycles! The hybrid model succeeds well for many tasks, but it actually decreases the R2 score of our previous Bitcoin prediction:

get_hybrid_univariate_prediction(dataset['BTC-USD'], 22,
                                     season_freq='A', 
                                     fourier_order=4, 
                                     constant=True, 
                                     dp_order=5,  
                                     dp_drop=True,
                                     model1=LinearRegression(),
                                     fourier=None, is_seasonal=True, season_period=30,
                                     dp=None, 
                                     to_predict=1, 
                                     nlags=20, 
                                     minimal_pacf=0.05, 
                                     model2=XGBRegressor(n_estimators=20)
                                         
                    )

The former score was around 0.31.

Conclusion

This article aims at presenting functions for your time series workflow, specially for lags and deseasonalization. Use them with care, though: apply them to have baseline scores before delving into more sophisticated models.

In future articles we will bring forth multi-step predictions (predict more than one day ahead) and compare performance of different models, both univariate and multivariate.

If you want to learn more about forecasting and how to integrate your models in a real dashboard app, check out our tutorial:

👉 How I Created a Forecasting App Using Streamlit