Principle:Sktime Pytorch forecasting TFT Model Instantiation
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Deep_Learning, Attention_Mechanisms |
| Last Updated | 2026-02-08 07:00 GMT |
Overview
Technique for instantiating the Temporal Fusion Transformer model with architecture parameters automatically inferred from dataset metadata and user-specified hyperparameters.
Description
The Temporal Fusion Transformer (TFT) is an attention-based architecture designed for multi-horizon time series forecasting with mixed inputs (static, known future, and observed covariates). The TFT uses variable selection networks to identify relevant features, gated residual networks for non-linear processing, temporal self-attention for long-range dependencies, and a multi-horizon quantile output for probabilistic forecasts. Model instantiation via from_dataset automatically configures embedding sizes, variable lists, and encoder length from the training dataset, ensuring consistency between data and model architecture.
Usage
Use this principle when building a multi-horizon demand forecasting model that requires interpretable attention weights and variable importance scores. TFT is the flagship model of pytorch-forecasting and is appropriate when: (1) you have multiple covariates of different types, (2) you need multi-step-ahead probabilistic forecasts, and (3) model interpretability (feature importance, temporal attention patterns) is valuable.
Theoretical Basis
The TFT architecture consists of:
1. Variable Selection Networks: Gate irrelevant features using GRN + softmax:
2. Gated Residual Networks (GRN): Non-linear processing with skip connections:
3. Interpretable Multi-Head Attention: Self-attention over temporal dimension with shared value weights across heads for interpretability.
4. Quantile Output: Produces multiple quantiles (e.g., 0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98) per forecast horizon.
Key hyperparameters:
- hidden_size — main capacity control (8-512, typical: 16-64)
- attention_head_size — number of attention heads (typical: 4)
- dropout — regularization (typical: 0.1-0.3)