Principle:Sktime Pytorch forecasting TFT V2 Architecture

Knowledge Sources	Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting pytorch-forecasting
Domains	Time_Series, Forecasting, Deep_Learning, Attention_Mechanisms, Transformer_Models
Last Updated	2026-02-08 09:00 GMT

Overview

V2 API implementation of the Temporal Fusion Transformer (TFT) built on the BaseModel pipeline, featuring variable selection networks, encoder-decoder LSTM sequence processing, multi-head self-attention over the combined temporal sequence, and static context integration for multi-horizon forecasting.

Description

This is an experimental re-implementation of the Temporal Fusion Transformer designed for the v2 data pipeline of pytorch-forecasting. Unlike the original v1 TFT (which inherits from BaseModelWithCovariates and uses the TimeSeriesDataSet), the v2 TFT inherits from BaseModel and consumes metadata dictionaries describing encoder/decoder continuous and categorical feature dimensions, static features, and sequence lengths.

Architecture components:

1. Variable Selection Networks: Separate variable selection modules for encoder and decoder inputs. Each consists of a two-layer MLP (Linear-ReLU-Linear-Sigmoid) that produces element-wise gating weights. These weights are multiplied with the raw input features to suppress irrelevant variables before further processing.

2. Static Context Processing: When static features (categorical or continuous) are present, a linear layer projects them into the hidden space. The resulting static context vector is broadcast and added to both encoder and decoder LSTM outputs, enabling the model to condition temporal processing on time-invariant information.

3. Encoder-Decoder LSTM: A multi-layer LSTM encodes the historical (encoder) input. Its final hidden and cell states initialize a separate multi-layer LSTM decoder that processes the future (decoder) inputs. This sequence-to-sequence design provides temporal abstraction of the input and bridges the encoder and decoder windows.

4. Multi-Head Self-Attention: The encoder and decoder LSTM outputs are concatenated along the time axis into a single temporal sequence. Multi-head self-attention (PyTorch native nn.MultiheadAttention) is applied over this combined sequence, allowing each decoder time step to attend to the full encoder history and all decoder positions. When static context is available, it is added to the query input to enrich the attention computation.

5. Output Layers: The attended decoder portion is passed through a ReLU-activated pre-output linear layer and then a final linear projection to produce predictions of dimension output_size per forecast step.

Usage

Use TFT V2 when working with the v2 data pipeline (BaseModel / metadata-based configuration) and when the forecasting task involves: (1) mixed covariate types (continuous and categorical, both encoder and decoder), (2) static features that condition the forecast, (3) multi-horizon predictions where attention over the full temporal context is beneficial. This implementation is marked experimental; for production usage with the v1 TimeSeriesDataSet pipeline, refer to the original TFT implementation.

Theoretical Basis

Variable selection gating:

$w = σ (W_{2} ReLU (W_{1} ξ + b_{1}) + b_{2})$

$\tilde{ξ} = w ⊙ ξ$

where $ξ$ is the concatenated input features and $w$ is the learned per-variable gate.

Encoder-decoder LSTM:

encoder_output, (h_n, c_n) = LSTM_encoder(selected_encoder_input)
decoder_output, _           = LSTM_decoder(selected_decoder_input, (h_n, c_n))

Static context enrichment:

${\tilde{e}}_{t} = e_{t} + c_{s}, {\tilde{d}}_{t} = d_{t} + c_{s}$

where Failed to parse (syntax error): {\displaystyle c_s = W_s [\text{static\_cat}; \text{static\_cont}]} is the projected static context.

Temporal self-attention:

$S = [{\tilde{e}}_{1}, \dots, {\tilde{e}}_{T}, {\tilde{d}}_{1}, \dots, {\tilde{d}}_{H}]$

$A = MultiHeadAttention (Q = S + c_{s}, K = S, V = S)$

Prediction output:

${\hat{y}}_{h} = W_{o u t} ReLU (W_{p r e} A_{T + h}) for h = 1, \dots, H$

Key hyperparameters:

hidden_size -- LSTM and attention hidden dimension (default: 64)
num_layers -- number of LSTM layers (default: 2)
attention_head_size -- number of attention heads (default: 4)
dropout -- regularization rate (default: 0.1)
output_size -- prediction output dimension per time step (default: 1)

Related Pages

Implemented By

Implementation:Sktime_Pytorch_forecasting_TFT_V2

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment