Principle:Sktime Pytorch forecasting xLSTM Architecture
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning, Recurrent_Neural_Networks |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
Extended LSTM (xLSTM) architecture for long-term time series forecasting, combining two novel LSTM variants -- mLSTM (matrix LSTM with matrix memory and exponential gating) and sLSTM (stabilized LSTM with normalized exponential gating) -- with series decomposition and normalization for robust multi-horizon predictions.
Description
xLSTMTime extends the classical LSTM by introducing two complementary cell designs that address the limited memory capacity and gradient stability issues of standard LSTMs:
mLSTM (Matrix LSTM): Replaces the scalar cell state with a matrix memory mechanism. The cell computes query, key, and value vectors from the input, forms an outer-product key-value interaction matrix, and accumulates this into the cell state via exponential gating. The hidden state is derived by multiplying the output gate with the tanh of the cell state scaled by a normalized accumulator. This gives the cell a richer, higher-capacity memory compared to standard scalar LSTM states. The mLSTM cell maintains three states: hidden state , cell state , and normalizer state .
sLSTM (Stabilized LSTM): Retains the scalar memory structure of a traditional LSTM but introduces normalized exponential gating for the input and forget gates. Instead of sigmoid activations, each gate uses a centered exponential function followed by normalization so that the input and forget gates sum to a stable partition. This prevents the vanishing/exploding gate problem while maintaining the simplicity of scalar-memory LSTMs. Layer normalization is applied throughout the cell for training stability.
xLSTMTime model architecture:
- Series decomposition: The input encoder sequence is decomposed into trend and seasonal components using a moving-average kernel of configurable size.
- Concatenation and projection: Trend and seasonal components are concatenated along the feature dimension and linearly projected to the hidden size.
- Batch normalization: Applied across the hidden dimension for stable training.
- Recurrent processing: The projected sequence is fed through a stacked mLSTM or sLSTM network (selectable via the xlstm_type parameter). Each network consists of multiple recurrent layers with residual connections, layer normalization, and dropout.
- Output projection: The final hidden state is linearly projected to the forecast horizon, followed by instance normalization.
Both the mLSTM and sLSTM variants are organized in a three-tier hierarchy: Cell (single time-step computation), Layer (multiple stacked cells over a sequence), and Network (layers plus a fully connected output head).
Usage
Use xLSTMTime when a recurrent architecture is preferred over a transformer for time series forecasting, particularly when: (1) the series exhibits strong sequential dependencies that benefit from recurrent inductive biases, (2) computational resources are limited relative to the sequence length (recurrent models avoid quadratic attention cost), and (3) the time series contains both trend and seasonal components. Choose mlstm for higher memory capacity tasks and slstm for simpler series where training stability is paramount.
Theoretical Basis
mLSTM cell update equations:
where denotes the outer product, denotes element-wise multiplication, and is the normalizer state that stabilizes the memory readout.
sLSTM cell update equations:
Normalized exponential gating:
where the centering and clamping of pre-gate values (within [-5, 5]) ensures numerical stability.
Series decomposition:
Overall xLSTMTime forward pass:
seasonal, trend = SeriesDecomposition(encoder_input)
x = Linear(concat(trend, seasonal))
x = BatchNorm(x)
output, hidden = xLSTM_Network(x) # mLSTM or sLSTM
prediction = InstanceNorm(Linear(output))
Key hyperparameters:
- xlstm_type -- "slstm" or "mlstm" variant selection
- hidden_size -- recurrent hidden dimension
- num_layers -- number of stacked recurrent layers
- decomposition_kernel -- moving-average kernel size (default: 25)
- dropout -- recurrent dropout rate (default: 0.1)
Related Pages
Implemented By
- Implementation:Sktime_Pytorch_forecasting_xLSTMTime
- Implementation:Sktime_Pytorch_forecasting_mLSTMCell
- Implementation:Sktime_Pytorch_forecasting_mLSTMLayer
- Implementation:Sktime_Pytorch_forecasting_mLSTMNetwork
- Implementation:Sktime_Pytorch_forecasting_sLSTMCell
- Implementation:Sktime_Pytorch_forecasting_sLSTMLayer
- Implementation:Sktime_Pytorch_forecasting_sLSTMNetwork