Although Neural Networks do a tremendous job learning rules in tabular, structured data, it leaves a great deal to be desired in terms of ‘unstructured’ data. And there we come to a new concept: Recurrent Neural Networks.
Recurrent Neural Network
A Recurrent Neural Network is to a Feedforward Neural Network as a single object is to a list: it may be thought as a set of interrelated feedforward networks, or a looped network.
It is specialized in picking up and highlighting the main characteristics of your data (more on that in Andrej Karpathy’s Blog). They are often followed by a Feed Forward (Dense) Layer which will weigh the output.
Long Short-Term Memory
Long Short-Term Memory (LSTM) clusters have the extra special ability to deal with time (more on it can be found in Colah’s article).
As the term memory suggests, its greatest promise is to understand correlations between past and present events. In particular, they fit naturally in time series forecasts.
Here we aim at a hands-on introduction to several LSTM-based architectures (and more is to come π).
Article Overview
We use Bitcoin daily closing price as a case study. Specifically, we use the Bitcoin price and sentiment analysis we have gathered in a previous article. We use TensorFlow‘s Keras API for the implementation.
In this article will aim at the following architectures:
- ‘Vanilla’ LSTM
- Stacked LSTM
- Bidirectional LSTM
- Encoder-Decoder LSTM-LSTM
- Encoder-Decoder CNN-LSTM
The last one being the more convoluted (pun intended).
There is one main issue dealing with time series, which is the implementation of the problem. Are common situation both having only the historical target value alone (univariate problem) or together with other information (multivariate problem).
Moreover, you might be interested in one-step prediction or a multi-step prediction, i.e., predicting only the next day or, say, all days in the next week. Although it doesn’t sound so, you have to adjust your model to whatever situation you are facing.Β
Think of how you would deal with a multivariate multi-step problem: should you train a one-step model and forecast all features in order to feed your model to predict the following days? That would be a crazy!
Kaggle’s time series course does a good job introducing the several strategies present to deal with multi-step prediction. Fortunately, setting an LSTM network for a multi-step multivariate problem is as easy as setting it for a univariate one-step problem – you just need to change two numbers.
This is another advantage of Neural Networks, apart from its capacity of memory.Β
Of course, the architecture list above is not exhaustive. For instance, a new Attention layer was recently introduced, which has been working wonders. We shall come back to it in a next article, where we will walk through a hybrid Attention-CLX model.
Credits to ML Mastery blog for part of the code.Β
π« Disclaimer: This article is a programming/data analysis tutorial only and is not intended to be any kind of investment advice.
How to Prepare the Data for LSTM?
We will use two sources of data, both explicit in our previous article: the SentiCrypt‘s Bitcoin sentiment analysis and Bitcoin’s daily closing price (by following the steps in the previous article, you can do it differently, using a minute-base data, for example).
Let us load the already-saved sentiment analysis and download the Bitcoin price:
import pandas as pd import yfinance as yf sentic = pd.read_csv('sentic.csv', index_col=0, parse_dates=True) sentic.index.freq='D' btc = yf.download('BTC-USD', start='2020-02-14', end='2022-09-23', period='1d')[['Close']] btc.columns = ['btc'] data = pd.concat([sentic,btc], axis=1) data
The LSTM layer expects a 3D array as input whose shape represents:
(data_size, timesteps, number_of_features)
.
Meaning, the first and last elements are the number of rows and columns from the input data, respectively. The timestep argument is the size of the time chunk you want your LSTM to process at a time. This will be the time frame the LSTM will look for relations between past and present. It is essentially the size of its (long short-term) memory.
To decide how many time-steps, we recall our first time series article where we explored partial auto-correlations of Bitcoin price’s lags.
That is easily achieved through statsmodels:
from statsmodels.graphics.tsaplots import plot_pacf import matplotlib.pyplot as plt plot_pacf(data.btc, lags=20) plt.show()
If you were there, in the first article, with me, you might remember our curious 10-lags correlation. Here we use this magic number and feed the model with a 10 days frame and to make a 5 days prediction. I found the results with 10 days better than for 6 or 20 days (for most cases – see below for more about this). We also assume we have today’s data and try to forecast the next 5 days.
An easy way to accomplish the reshaping of the data is through (a slight modification) of our make_lags
function together with NumPy’s reshape()
method.
So, instead of a Series, we will take a DataFrame as input and will output a concatenation of the original frame with its respective lags. We use negative lags to prepare the target DataFrame. We will ignore observations with the produced NaN values and will use the align method to align their indexes.Β
def make_lags(df, n_lags=1, lead_time=1): """ Compute lags of a pandas.DataFrame from lead_time to lead_time + n_lags. Alternatively, a list can be passed as n_lags. Returns a pd.DataFrame resulting from the concatenation of df's shifts. """ if isinstance(n_lags,int): lag_list = range(lead_time, n_lags+lead_time) else: lag_list = n_lags lags=list() for i in lag_list: df_lag = df.shift(i) if i!=0: df_lag.columns = [f'{col}_lag_{i}' for col in df.columns] lags.append(df_lag) return pd.concat(lags, axis=1) X = make_lags(data, n_lags=20, lead_time=0).dropna() y = make_lags(data[['btc']], n_lags=range(-5,0)).dropna() X, y = X.align(y, join='inner', axis=0)
Next, we train-test split the data with sklearn, taking 10% as test size. As usual for time series, we include shuffle=False
as a parameter.
from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.1, shuffle=False)
Before proceeding, it is good practice to normalize the data before feeding it into a Neural Network. We do it now, before things get 3D.
from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train) X_train, X_val = mms.transform(X_train), mms.transform(X_val)
Finally, we use NumPy to reshape everything to 3D arrays. Observe that there is not such a thing as a 3D pd.DataFrame
.
import numpy as np def add_dim(df, timesteps=5): """ Transforms a pd.DataFrame into a 3D np.array with shape (n_samples, timesteps, n_features) """ df = np.array(df) array_3d = df.reshape(df.shape[0],timesteps ,df.shape[1]//timesteps) return array_3d X_train, X_val = map(add_dim, [X_train, X_val], [timesteps]*2)
Of course, you can always prepare a function to do everything in one shot:
def prepare_data(df, target_name, n_lags, n_steps, lead_time, test_size, normalize=True): ''' Prepare data for LSTM. ''' if isinstance(n_steps,int): n_steps = range(1,n_steps+1) n_steps = [-x for x in list(n_steps)] X = make_lags(df, n_lags=n_lags, lead_time=lead_time).dropna() y = make_lags(df[[target_name]], n_lags=n_steps).dropna() X, y = X.align(y, join='inner', axis=0) from sklearn.model_selection import train_test_split X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, shuffle=False) if normalize: from sklearn.preprocessing import MinMaxScaler mms = MinMaxScaler().fit(X_train) X_train, X_val = mms.transform(X_train), mms.transform(X_val) if isinstance(n_lags,int): timesteps = n_lags else: timesteps = len(n_lags) return add_dim(X_train,timesteps), add_dim(X_val,timesteps), y_train, y_val
Note that one should give positive values to n_steps
to have the right negative shifts. Fortunately, y_train
, y_val
are not reshaped, which makes life easier when comparing predictions with reality.
All set, let’s start with the most basic Vanilla model.
π‘ Side note: We are keeping things simple here, but in a future post, we will prepare our own batches and explore better the stateful parameter of an LSTM layer. More on its input and output can be found in Mohammad’s Git.
How to Implement Vanilla LSTM with Keras?
A model is called Vanilla when it has no additional structure apart from the output layer.
To implement it we add an LSTM and a Dense layer. We must pass the number of units of each and the input shape for the LSTM layer.
The input shape is exactly (n_timesteps, n_features)
which can be inferred from X_train.shape
. The number of units for the LSTM layer is a hyperparameter and shall be tuned, for the Dense layer it is the number of outputs we want. Therefore 5.
Next follows a hypertuning-friendly code, specifying the main parameters in advance.Β
from keras.models import Sequential from keras.layers import Dense, LSTM # Data preparation n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters epochs, batch_size, verbose = 50, 72, 0 model_params = {} # preparing data X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # model architecture vanilla = Sequential() vanilla.add(LSTM(units=200, activation='relu', input_shape=(X_train.shape[1],X_train.shape[2]) )) vanilla.add(Dense(n_steps))
The model_params
dictionary will be useful for including additional parameters to the compile method, such as an EarlyStopping
callback.Β
We also write a function that fits the model, plot and assess predictions. The present code does not output anything, so, feel free to change it in order to do so. We fix the optimizer as Adam and the loss metric as Mean Squared Error.
def fit_model(model, learning_rate=0.001, time_distributed=False, epochs=epochs, batch_size=batch_size, verbose=verbose): y_ind = y_val.index if time_distributed: y_train_0 = y_train.to_numpy().reshape((y_train.shape[0], y_train.shape[1],1)) y_val_0 = y_val.to_numpy().reshape((y_val.shape[0], y_val.shape[1],1)) else: y_train_0 = y_train y_val_0 = y_val # fit network from keras.optimizers import Adam adam = Adam(learning_rate=learning_rate) model.compile(loss='mse', optimizer='adam') history = model.fit(X_train, y_train_0, epochs=epochs, batch_size=batch_size, verbose=verbose, **model_params, validation_data=(X_val, y_val_0), shuffle=False) # make a prediction if time_distributed: predictions = model.predict(X_val)[:,:,0] else: predictions = model.predict(X_val) yhat = pd.DataFrame(predictions, index=y_ind, columns=[f'pred_lag_{i}' for i in range(-n_steps,0)]) yhat_shifted = pd.concat([yhat.iloc[:,i].shift(-n_steps+i) for i in range(len(yhat.columns))], axis=1) # calculate RMSE from sklearn.metrics import mean_squared_error, r2_score rmse = np.sqrt(mean_squared_error(y_val, yhat)) import matplotlib.pyplot as plt fig, (ax1,ax2) = plt.subplots(2,1,figsize=(14,14)) y_val.iloc[:,0].plot(ax=ax2,legend=True) yhat_shifted.plot(ax=ax2) ax2.set_title('Prediction comparison') ax2.annotate(f'RMSE: {rmse:.5f} \n R2 score: {r2_score(yhat,y_val):.5f}', xy=(.68,.93), xycoords='axes fraction') ax1.plot(history.history['loss'], label='train') ax1.plot(history.history['val_loss'], label='test') ax1.legend() plt.show()
The time_distributed
parameter will be used in the last two architectures.
I opted to set a manual learning_rate
since once the Stacked LSTM’s output was an array of NaNs. After figuring out that the gradient descent was not converging, that was fixed by decreasing Adam’s learning rate.
Use verbose=1
as a global parameter to debug your network.
Without further ado:
fit_model(vanilla)
The performance is comparable to our XGBoost 1-day prediction in the last article:
Moreover, we are predicting 5 days, not only one, making the r2 score more impressive.
What bothers me, on the other hand, is the fact the predictions for all five days look identical. It requires further analysis to understand why that is happening, which we will not do here.
How to Build a Stacked LSTM?
We also can queue two LSTM layers.
To this aim, we need to be careful to give a 3D input to the second LSTM layer and that is the role the parameter return_sequences
plays. We gain a slight increase in the training score in this case.
# model architecture stacked = Sequential() stacked.add(LSTM(100, activation='relu', return_sequences=True, input_shape=(X_train.shape[1],X_train.shape[2]))) stacked.add(LSTM(100, activation='relu')) stacked.add(Dense(n_steps)) fit_model(stacked)
What is a Bidirectional LSTM Layer?
In general, any RNN within minimal requirements can be made bidirectional through Keras’ Bidirectional layer. It stacks two copies of your RNN layer, making one backward.Β
You can either specify the backward_layer
as a second RNN layer or just wrap a single one, which will make the Bidirectional instance use a copy as the backward model. An implementation can be found below.
The score is comparable to the Stacked LSTM.
from keras.layers import Bidirectional bilstm = Sequential() bilstm.add(Bidirectional(LSTM(100, activation='relu'), input_shape=(X_train.shape[1], X_train.shape[2]))) bilstm.add(Dense(n_steps)) fit_model(bilstm)
Encoder-Decoder LSTM
An Encoder-Decoder structure is designed in a way you have one network dedicated to feature selection and a second one to the actual forecast. The architectures used can be of different types; even of recurrent-non recurrent pairs are allowed.
Here we explore two pairs: LSTM-LSTM and CNN-LSTM.Β
Compared to the previous presented architectures, the main difference is the inclusion of the RepeatVector
layer and the wrapper TimeDistributed
.
Although the RepeatVector
is smoothly included, the TimeDistributed
layer needs some care. It wraps a layer object and has the duty to apply a copy of each to each temporal slice imputed into it. It considers the .shape[1]
of the first input as the temporal dimension (our prepare_data
is in accordance to that).
Moreover, one has to watch out since it outputs a 3D array, in particular our model will output 3D predictions.
For this reason, we have to feed the model with reshaped y_val
, y_train
so that the loss functions can be computed. Fortunately, we already included the time_distributed
parameter in the fit_model
to deal with the reshaping.
We also increase the number of Epochs since these networks seem to take longer to find a minimum. We include an EarlyStopping
though. It already gives an astonishing score!
from keras.layers import RepeatVector, TimeDistributed # Data preparation n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size, normalize=True) # Encoder lstmlstm = Sequential() lstmlstm.add(LSTM(100, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]))) lstmlstm.add(RepeatVector(n_steps)) # Decoder lstmlstm.add(LSTM(100, activation='relu', return_sequences=True)) lstmlstm.add(TimeDistributed(Dense(n_steps))) fit_model(lstmlstm, time_distributed=True)
This is the first time the steps outputs are visibly different from each other.
Nevertheless, it seems to be following some trend. In theory, the NN should be so powerful that it can capture trends as well. However, in practice detrending often gives better results. Nevertheless, 0.82 is a massive increase from our 0.32 XGBoost.Β
Encoder-Decoder CNN-LSTM Network
The last architecture we present is the CNN-LSTM one.
Here a Convolutional Neural Network is used as a feature selector, being well-known to perform well in this role for photos and videos.
The main reason they are so useful in this case is mathematical: the convolutional part of CNN’s name refers to the convolution operation in mathematics, which is used to emphasize translation-invariant features.
That makes complete sense when you have a photo, since you want your mobile phone to recoginze Toto as a dog, independent if it is in the lower-left corner or in the upper-center of the picture (of course your dog’s name is Toto, right?). You may recognize the CNN action as the smoothed lines in the graph.Β
from keras.layers import RepeatVector, TimeDistributed, Conv1D, MaxPooling1D, Flatten # Data preparation n_lags, n_steps, lead_time, test_size = 10, 5, 0, .2 # hyperparameters epochs, batch_size, verbose = 300, 32, 0 model_params = {'callbacks':[EarlyStopping( monitor="val_loss", patience=20, mode="auto")]} # preparing data X_train, X_val, y_train, y_val = prepare_data(data, 'btc', n_lags, n_steps, lead_time, test_size) # Encoder cnn_lstm = Sequential() cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu', input_shape=(X_train.shape[1], X_train.shape[2]))) cnn_lstm.add(Conv1D(filters=64, kernel_size=3, activation='relu')) cnn_lstm.add(MaxPooling1D(pool_size=2)) cnn_lstm.add(Flatten()) cnn_lstm.add(RepeatVector(n_steps)) # Decoder cnn_lstm.add(LSTM(200, activation='relu', return_sequences=True)) cnn_lstm.add(TimeDistributed(Dense(100, activation='relu'))) cnn_lstm.add(TimeDistributed(Dense(n_steps))) fit_model(cnn_lstm, time_distributed=True)
Extra Perks
For the sake of completion, we tweaked the code around a bit.
Do you remember the seemly significant correlation popped up in the 20-days lags? Well, increasing from 10 to 20 timesteps actually increases the R2 score in the last model:
Funnily enough, it increases even more if you use unnormalized data, making a stellar ~.94 score!
The last thing worth mentioning is the choice of the activation function. If you got the Warning below and wonder why, the Keras’ LSTM documentation provides an answer.
π WARNING: tensorflow:Layer lstm_70
will not use cuDNN kernels since it doesn’t meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
(No, I did not loaded 70 LSTM layers. I loaded around 210 π΅βπ«)
The documentation says:
“The requirements to use the cuDNN implementation are:
- activation == tanh
- recurrent_activation == sigmoid
- recurrent_dropout == 0
- unroll is False
- use_bias is True
- Inputs, if use masking, are strictly right-padded.
- Eager execution is enabled in the outermost context.”
Changing the activation to ‘tanh‘ is enough in our case to use cuDNN, and they are incredibly faster! However tanh fits poorly into our problem:
fit_model(cnn_lstm, time_distributed=True, learning_rate=1)
(You saw it right, the learning rate is 1000x larger than the default. Otherwise the loss curve does not even change.)
Main Takeaways
There are a few points we have to keep in mind about LSTM:
- The shape of their inputΒ
- What are time steps
- The shape of the layer’s output, especially when using
return_sequences
- Hyperparameters tunning is worth your time. For instance, the activation functions relu and tanh have their own pros and cons.
- There are different architectures to play with (and many more to come – we will deal with Attention blocks and Multi-headed networks soon). Consider using them. I’ve become specially inclined towards the Encoder-Decoders
Feel free to use and edit the code here.Β