Principle:Sktime Pytorch forecasting Samformer Architecture
| Knowledge Sources | |
|---|---|
| Domains | Time_Series, Forecasting, Deep_Learning, Transformer_Models, Optimization |
| Last Updated | 2026-02-08 09:00 GMT |
Overview
SAMFormer is a lightweight, channel-independent transformer for multivariate time series forecasting that applies Sharpness-Aware Minimization (SAM) optimization and Reversible Instance Normalization (RevIN) to achieve competitive accuracy with a simple single-layer attention architecture.
Description
SAMFormer addresses the observation that simple linear models often outperform complex transformers on multivariate time series forecasting benchmarks. The paper attributes this to transformers converging to sharp minima that generalize poorly. SAMFormer resolves this by combining a minimal transformer architecture with SAM optimization, which explicitly seeks flat minima in the loss landscape.
Architecture:
The model operates in a channel-wise (channel-independent) fashion. Each input variable (channel) is treated as an independent sequence. The full encoder context of length L is used directly as the token representation, transposed so that channels become the sequence dimension and time steps become the feature dimension.
- Reversible Instance Normalization (RevIN): The input tensor is normalized per-instance using learnable affine parameters. This removes non-stationary distribution shifts during forward pass and restores them during prediction output.
- Channel-wise attention: Three linear projections (queries, keys, values) map the transposed input from the time dimension into the attention space. Scaled dot-product attention is computed over the channel dimension, capturing inter-variable relationships.
- Residual connection: The attention output is added back to the normalized input.
- Linear forecaster: A single linear layer projects from the encoder length L to the prediction length H.
The resulting architecture has far fewer parameters than typical deep transformers, making it fast to train and resistant to overfitting on small datasets.
SAM optimization: SAMFormer is designed to be trained with Sharpness-Aware Minimization, which performs a two-step gradient update: first, it perturbs the weights in the direction of steepest ascent within a neighborhood of radius ; then, it computes the gradient at the perturbed point and updates normally. This encourages convergence to flat minima that generalize better.
Usage
Use SAMFormer when: (1) a lightweight transformer is desired for multivariate forecasting, (2) the dataset is small or medium-sized and overfitting is a concern, (3) the data exhibits non-stationary distribution shifts that benefit from RevIN normalization. The model currently supports single-target forecasting in the v2 API. Pair it with a SAM-compatible optimizer for best results, though it can also be trained with standard Adam.
Theoretical Basis
Reversible Instance Normalization (RevIN):
where are per-instance statistics, and are learnable affine parameters. During output, the normalization is reversed to restore the original scale.
Channel-wise scaled dot-product attention:
Given input (batch, channels, time), with denoting the hidden size:
where and .
Residual forecasting:
where is the prediction horizon. The target channel's output is extracted as the final prediction.
Sharpness-Aware Minimization (SAM):
SAM finds parameters that minimize not just the loss but also the sharpness of the loss landscape, leading to improved generalization.
Key hyperparameters:
- hidden_size (r) -- attention embedding dimension (typical: 512)
- use_revin -- enable/disable RevIN normalization
- persistence_weight -- blending weight for a naive persistence term (default: 0.0)