A Time Series is essentially a tabular data with the special feature of having a time index. The common forecast taks is *‘knowing the past (and sometimes the present), predict the future’*. This task, taken as a principle, reveals itself in several ways: in how to interpret your problem, in feature engineering and in which forecast strategy to take.

This is the second article in our series. In the first article we discussed how to create features out of a time series using lags and trends. Today we follow the opposite direction by highlighting trends as something you want directly deducted from your model.

Reason is, Machine Learning models work in different ways. Some are good with subtractions, others are not.

For example, for any feature you include in a Linear Regression, the model will automatically detect whether to deduce it from the actual data or not. A Tree Regressor (and its variants) will not behave in the same way and usually will ignore a trend in the data.

Therefore, whenever using the latter type of models, one usually calls for a *hybrid model*, meaning, we use a Linear(ish) first model to detect global periodic patterns and then apply a second Machine Learning model to infer more sophisticated behavior.

We use the Bitcoin Sentiment Analysis data we captured in the last article as a proof of concept.

The hybrid model part of this article is heavily based on Kaggle’s Time Series Crash Course, however, we intend to automate the process and discuss more in-depth the `DeterministicProcess`

class.

## Trends, as something you don’t want to have

(Or that you want it deducted from your model)

An aerodynamic way to deal with trends and seasonality is using, respectively, `DeterministicProcess`

and `CalendarFourier`

from `statsmodel`

. Let us start with the former.

`DeterministicProcess`

aims at creating features to be used in a Regression model to determine trend and periodicity. It takes your `DatetimeIndex`

and a few other parameters and returns a DataFrame full of features for your ML model.

A usual instance of the class will read like the one below. We use the `sentic_mean`

column to illustrate.

from statsmodels.tsa.deterministic import DeterministicProcess y = dataset['sentic_mean'].copy() dp = DeterministicProcess( index=y.index, constant=True, order=2 ) X = dp.in_sample() X

We can use `X`

and `y`

as features and target to train a `LinearRegression`

model. In this way, the `LinearRegression`

will learn whatever characteristics from `y`

can be inferred (in our case) solely out of:

- the number of elapsed time intervals (
`trend`

column); - the last number squared (
`trend_squared`

); and - a bias term (
`const`

).

Check out the result:

from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve'] )

Comparing predictions and actual values gives:

import matplotlib.pyplot as plt plt.figure() ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) plt.show()

Even the quadratic term seems ignorable here. The `DeterministicProcess`

class also helps us with future predictions since it carries a method that provides the appropriate future form of the chosen features.

Specifically, the `out_of_sample`

method of `dp`

takes the number of time intervals we want to predict as input and generates the needed features for you.

We use 60 days below as an example:

X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions'] ) plt.figure() ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) predictions_out.plot(ax=ax, color='red') plt.show()

Let us repeat the process with `sentic_count`

to have a feeling of a higher-order trend.

👍 **As a rule of thumb, the order should be one plus the total number of (trending) hills + peaks in the graph, but not much more than that.**

We choose 3 for `sentic_count`

and compare the output with the `order=2`

result (we do not write the code twice, though).

y = dataset['sentic_count'].copy() from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier dp = DeterministicProcess( index=y.index, constant=True, order=3 ) X = dp.in_sample() model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Deterministic Curve'] ) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Future Predictions'] ) plt.figure() ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) predictions_out.plot(ax=ax, color='red') plt.show()

Although the order-three polynomial fits the data better, use discretion in deciding whether the sentiment count will decrease so drastically in the next 60 days or not. Usually, trust short-time predictions rather than long ones.

`DeterministicProcess`

accepts other parameters, making it a very interesting tool. Find a description of the almost full list below.

dp = DeterministicProcess( index, # the DatetimeIndex of your data period: int or None, # in case the data shows some periodicity, include the size of the periodic cycle here: 7 would mean 7 days in our case constant: bool, # includes a constant feature in the returned DataFrame, i.e., a feature with the same value for everyone. It returns the equivalent of a bias term in Linear Regression order: int, # order of the polynomial that you think better approximates your trend: the simplest the better seasonal: bool, # make it True if you think the data has some periodicity. If you make it True and do not specify the period, the dp will try to infer the period out of the index additional_terms: tuple of statsmodel's DeterministicTerms, # we come back to this next drop: bool # drops resulting features which are collinear to others. If you will use a linear model, make it True )

## Seasonality

As a hardened Mathematician, seasonality is my favorite part because it deals with Fourier analysis (and wave functions are just… cool!):

Do you remember your first ML course when you heard Linear Regression can fit arbitrary functions, not only lines? So, why not a wave function? We just did it for polynomials and didn’t even feel like it 😉

In general, for any expression `f`

which is a function of a feature or of your `DatetimeIndex`

, you can create a feature column whose ith row is the value of `f`

corresponding to the ith index.

Then linear regression finds the constant coefficient multiplying `f`

that best fits your data. Again, this procedure works in general, not only with Datetime indexes – the `trend_squared`

term above is an example of it.

For seasonality, we use a second `statsmodel`

‘s amazing class: `CalendarFourier`

. It is another `statsmodel`

‘s `DeterministicTerm`

class (i.e., with the `in_sample`

and `out_of_sample`

methods) and instantiates with two parameters, `'frequency'`

and `'order'`

.

As a `'frequency'`

, the class expects a string such as ‘D’, ‘W’, ‘M’ for day, week or month, respectively, or any of the quite comprehensive Pandas Datetime offset aliases.

The `'order'`

is the Fourier expansion order which should be understood as the number of waves you are expecting in your chosen frequency (count the number of ups and downs – one wave would be understood as one up and one down)

`CalendarFourier`

integrates swiftly with `DeterministicProcess`

by including an instance of it in the list of `additional_terms`

.

Here is the full code for `sentic_mean`

:

from statsmodels.tsa.deterministic import DeterministicProcess, CalendarFourier y = dataset['sentic_mean'].copy() fourier = CalendarFourier(freq='A',order=2) dp = DeterministicProcess( index=y.index, constant=True, order=2, seasonal=False, additional_terms=[fourier], drop=True ) X = dp.in_sample() from sklearn.linear_model import LinearRegression model = LinearRegression().fit(X,y) predictions = pd.DataFrame( model.predict(X), index=X.index, columns=['Prediction'] ) X_out = dp.out_of_sample(60) predictions_out = pd.DataFrame( model.predict(X_out), index=X_out.index, columns=['Prediction'] ) plt.figure() ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) predictions_out.plot(ax=ax, color='red') plt.show()

If we take `seasonal=True`

inside `DeterministicProcess`

, we get a crispier line:

Including `ax.set_xlim(('2022-08-01', '2022-10-01'))`

before `plt.show()`

zooms the graph in:

Although I suggest using the `seasonal=True`

parameter with care, it does find interesting patterns (with huge RMSE error, though).

For instance, look at this BTC percentage change zoomed chart:

Here period is set to 30 and `seasonal=True`

. I also manually rescaled the predictions to be better visible in the graphic. Although the predictions are far away from truth, thinking as a trader, isn’t it impressive how many peaks and hills it gets right? At least for this zoomed month…

To maintain the workflow promise, I prepared a code that does everything so far in one shot:

def deseasonalize(df: pd.Series, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model=LinearRegression(), fourier=None, dp=None, **DeterministicProcesskwargs)->(pd.Series, plt.Axes, pd.DataFrame): """ Returns a deseasonalized and detrended df, a seasonal plot, and the fitted DeterministicProcess instance. """ if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=df.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, **DeterministicProcesskwargs ) X = dp.in_sample() model = LinearRegression().fit(X, df) y_pred = pd.Series( model.predict(X), index=X.index, name=df.name+'_pred' ) ax = plt.subplot() y.plot(ax=ax, legend=True) predictions.plot(ax=ax) y_pred.columns = df.name y_deseason = df - y_pred y_deseason.name = df.name +'_deseasoned' return y_deseason, ax, dp The sentic_mean analyses get reduced to: y_deseason, ax, dp= deseasonalize(y, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() )

## Cycles and Hybrid Models

Let us move on to a complete Machine Learning prediction. We use `XGBRegressor`

and compare its performance among three instances:

- Predict
`sentic_mean`

directly using lags; - Same prediction adding the seasonal/trending with a
`DeterministicProcess`

; - A hybrid model, using
`LinearRegression`

to infer and remove seasons/trends, and then apply a`XGBRegressor`

.

The first part will be the bulkier since the other two follow from simple modifications in the resulting code.

### Preparing the data

Before any analysis, we split the data in train and test sets. Since we are dealing with time series, this means we set the ‘present date’ as a point in the past and try to predict its respective ‘future’. Here we pick 22 days in the past.

s = dataset['sentic_mean'] s_train = s[:'2022-09-01']

We made this first split in order to not leak data while doing any analysis.

Next, we prepare target and feature sets. Recall our SentiCrypto’s data was set to be available everyday at 8AM. Imagine we are doing the prediction by 9AM.

In this case, anything until the present data (the ‘`lag_0`

‘) can be used as features, and our target is `s_train`

‘s first lead (which we define as a -1 lag). To choose other lags as features, we examine theirs statsmodel’s partial auto-correlation plot:

from statsmodels.graphics.tsaplots import plot_pacf plot_pacf(s_train, lags=20)

We use the first four for `sentic_mean`

and the first seven + the 11th for `sentic_count`

(you can easily test different combinations with the code below.)

Now we finish choosing features, we go back to the full series for engineering. We apply to `s_maen`

and `s_count`

the `make_lags`

function we defined in the last article (which we transcribe here for convenience).

def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.Series from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame whose ith column is either the i+lead_time lag or the ith element of n_lags. """ if isinstance(n_lags,int): lag_list = list(range(lead_time, n_lags+lead_time)) else: lag_list = n_lags lags ={ f'{df.name}_lag_{i}': df.shift(i) for i in lag_list } return pd.concat(lags,axis=1) X = make_lags(s, [0,1,2,3,4]) y = make_lags(s, [-1]) display(X) y

Now a train-test split with `sklearn`

is convenient (Notice the `shuffle=False`

parameter, that is key for time series):

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=22, shuffle=False) X_train

(Observe that the final date is set correctly, in accordance with our analysis’ split.)

Applying the regressor:

xgb = XGBRegressor(n_estimators=50) xgb.fit(X_train,y_train) predictions_train = pd.DataFrame( xgb.predict(X_train), index=X_train.index, columns=['Prediction'] ) predictions_test = pd.DataFrame( xgb.predict(X_test), index=X_test.index, columns=['Prediction'] ) print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') plt.figure() ax = plt.subplot() y_train.plot(ax=ax, legend=True) predictions_train.plot(ax=ax) plt.show() plt.figure() ax = plt.subplot() y_test.plot(ax=ax, legend=True) predictions_test.plot(ax=ax) plt.show() print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}')

You can reduce overfitness by reducing the number of estimators, but the R2 test score maintains negative.

We can replicate the process for `sentic_count`

(or whatever you want). Below is a function to automate it.

from xgboost import XGBRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score from statsmodels.tsa.stattools import pacf def apply_univariate_prediction(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1, model=XGBRegressor(n_estimators=50)): ''' Starting from series, breaks it in train and test subsets; chooses which lags to use based on pacf > minimal_pacf; and applies the given sklearn-type model. Returns the resulting features and targets and the trained model. It plots the graph of the training and prediction, together with their r2_score. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s, nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False) model.fit(X_train,y_train) predictions_train = pd.DataFrame( model.predict(X_train), index=X_train.index, columns=['Train Predictions'] ) predictions_test = pd.DataFrame( model.predict(X_test), index=X_test.index, columns=['Test Predictions'] ) fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train[:-1],predictions_train[:-1])}') print(f'R2 test score: {r2_score(y_test[:-1],predictions_test[:-1])}') return X, y, model apply_univariate_prediction(dataset['sentic_count'],22)

apply_univariate_prediction(dataset['BTC-USD'], 22)

## Predicting with Seasons

Since the features created by `DeterministicProcess`

are only time-dependent, we can add them harmlessly to the feature DataFrame we automated get from our univariate predictions.

The predictions, though, are still univariate. We use the deseasonalize function to obtain the season features. The data preparation is as follows:

s = dataset['sentic_mean'] X, y, _ = apply_univariate_prediction(s,22); s_deseason, _, dp = deseasonalize(s, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model=LinearRegression() ); X_f = dp.in_sample().shift(-1) X = pd.concat([X,X_f], axis=1, join='inner').dropna()

With a bit of copy and paste, we arrive at:

And we actually perform way worse! 😱

## Deseasonalizing

Nevertheless, the right-hand graphic illustrates the inability of grasping trends. Our last shot is a hybrid model.

Here we follow three steps:

- We use the
`LinearRegression`

to capture the seasons and trends, rendering the series`y_s`

. Then we acquire a deseasonalized target`y_ds = y-y_s`

; - Train an
`XGBRegressor`

on`y_ds`

and the lagged features, resulting in deseasonalized predictions`y_pred`

; - Finally, we incorporate
`y_s`

back to`y_pred`

to compare the final result.

Although Bitcoin-related data are hard to predict, there was a huge improvement on the `r2_score`

(finally something positive!). We define the used function below.

get_hybrid_univariate_prediction(dataset['sentic_mean'], 22, season_freq='A', fourier_order=2, constant=True, dp_order=2, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=7, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) )

Instead of going through every detail, we will also automate this code. In order to get the code running smoothly, we revisit the deseasonalize and the `apply_univariate_prediction`

functions in order to remove the plotting part of them.

The final function only plots graphs and returns nothing. It intends to give you a baseline for a hybrid model score. Change the function at will to make it return whatever you need.

def get_season(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None): """ Decompose series in a deseasonalized and a seasonal part. The parameters are relative to the fourier and DeterministicProcess used. Returns y_ds and y_s. """ se = series.iloc[:-test_size] if fourier is None: fourier = CalendarFourier(freq=season_freq, order=fourier_order) if dp is None: dp = DeterministicProcess( index=se.index, constant=True, order=dp_order, additional_terms=[fourier], drop=dp_drop, seasonal=is_seasonal, period=season_period ) X_in = dp.in_sample() X_out = dp.out_of_sample(test_size) model1 = model1.fit(X_in, se) X = pd.concat([X_in,X_out],axis=0) y_s = pd.Series( model1.predict(X), index=X.index, name=series.name+'_pred' ) y_s.name = series.name y_ds = series - y_s y_ds.name = series.name +'_deseasoned' return y_ds, y_s def prepare_data(series, test_size, to_predict=1, nlags=20, minimal_pacf=0.1): ''' Creates a feature dataframe by making lags and a target series by a negative to_predict-shift. Returns X, y. ''' s = series.iloc[:-test_size] if isinstance(to_predict,int): to_predict = [to_predict] from statsmodels.tsa.stattools import pacf s_pacf = pd.Series(pacf(s,nlags=nlags)) column_list = s_pacf[s_pacf>minimal_pacf].index X = make_lags(series, n_lags=column_list).dropna() y = make_lags(series,n_lags=[-x for x in to_predict]).loc[X.index].squeeze() return X, y def get_hybrid_univariate_prediction(series: pd.Series, test_size, season_freq='A', fourier_order=0, constant=True, dp_order=1, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=False, season_period=None, dp=None, to_predict=1, nlags=20, minimal_pacf=0.1, model2=XGBRegressor(n_estimators=50) ): """ Apply the hybrid model method by deseasonalizing/detrending a time series with model1 and investigating the resulting series with model2. It plots the respective graphs and computes r2_scores. """ y_ds, y_s = get_season(series, test_size, season_freq=season_freq, fourier_order=fourier_order, constant=constant, dp_order=dp_order, dp_drop=dp_drop, model1=model1, fourier=fourier, dp=dp, is_seasonal=is_seasonal, season_period=season_period) X, y_ds = prepare_data(y_ds,test_size=test_size) X_train, X_test, y_train, y_test = train_test_split(X, y_ds, test_size=test_size, shuffle=False) y = y_s.squeeze() + y_ds.squeeze() model2 = model2.fit(X_train,y_train) predictions_train = pd.Series( model2.predict(X_train), index=X_train.index, name='Prediction' )+y_s[X_train.index] predictions_test = pd.Series( model2.predict(X_test), index=X_test.index, name='Prediction' )+y_s[X_test.index] fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,5), sharey=True) y_train_ps = y.loc[y_train.index] y_test_ps = y.loc[y_test.index] y_train_ps.plot(ax=ax1, legend=True) predictions_train.plot(ax=ax1) ax1.set_title('Train Predictions') y_test_ps.plot(ax=ax2, legend=True) predictions_test.plot(ax=ax2) ax2.set_title('Test Predictions') plt.show() print(f'R2 train score: {r2_score(y_train_ps[:-to_predict],predictions_train[:-to_predict])}') print(f'R2 test score: {r2_score(y_test_ps[:-to_predict],predictions_test[:-to_predict])}')

**A note of warning:** if you do not expect your data to follow time patterns, do focus on cycles! The hybrid model succeeds well for many tasks, but it actually decreases the R2 score of our previous Bitcoin prediction:

get_hybrid_univariate_prediction(dataset['BTC-USD'], 22, season_freq='A', fourier_order=4, constant=True, dp_order=5, dp_drop=True, model1=LinearRegression(), fourier=None, is_seasonal=True, season_period=30, dp=None, to_predict=1, nlags=20, minimal_pacf=0.05, model2=XGBRegressor(n_estimators=20) )

The former score was around 0.31.

## Conclusion

This article aims at presenting functions for your time series workflow, specially for lags and deseasonalization. Use them with care, though: apply them to have baseline scores before delving into more sophisticated models.

In future articles we will bring forth multi-step predictions (predict more than one day ahead) and compare performance of different models, both univariate and multivariate.