Principle:Sktime Pytorch forecasting Samformer Architecture

Knowledge Sources	SAMFormer: Unlocking the Potential of Transformers in Time Series Forecasting with Sharpness-Aware Minimization and Channel-Wise Attention pytorch-forecasting
Domains	Time_Series, Forecasting, Deep_Learning, Transformer_Models, Optimization
Last Updated	2026-02-08 09:00 GMT

Overview

SAMFormer is a lightweight, channel-independent transformer for multivariate time series forecasting that applies Sharpness-Aware Minimization (SAM) optimization and Reversible Instance Normalization (RevIN) to achieve competitive accuracy with a simple single-layer attention architecture.

Description

SAMFormer addresses the observation that simple linear models often outperform complex transformers on multivariate time series forecasting benchmarks. The paper attributes this to transformers converging to sharp minima that generalize poorly. SAMFormer resolves this by combining a minimal transformer architecture with SAM optimization, which explicitly seeks flat minima in the loss landscape.

Architecture:

The model operates in a channel-wise (channel-independent) fashion. Each input variable (channel) is treated as an independent sequence. The full encoder context of length L is used directly as the token representation, transposed so that channels become the sequence dimension and time steps become the feature dimension.

Reversible Instance Normalization (RevIN): The input tensor is normalized per-instance using learnable affine parameters. This removes non-stationary distribution shifts during forward pass and restores them during prediction output.
Channel-wise attention: Three linear projections (queries, keys, values) map the transposed input from the time dimension into the attention space. Scaled dot-product attention is computed over the channel dimension, capturing inter-variable relationships.
Residual connection: The attention output is added back to the normalized input.
Linear forecaster: A single linear layer projects from the encoder length L to the prediction length H.

The resulting architecture has far fewer parameters than typical deep transformers, making it fast to train and resistant to overfitting on small datasets.

SAM optimization: SAMFormer is designed to be trained with Sharpness-Aware Minimization, which performs a two-step gradient update: first, it perturbs the weights in the direction of steepest ascent within a neighborhood of radius $ρ$ ; then, it computes the gradient at the perturbed point and updates normally. This encourages convergence to flat minima that generalize better.

Usage

Use SAMFormer when: (1) a lightweight transformer is desired for multivariate forecasting, (2) the dataset is small or medium-sized and overfitting is a concern, (3) the data exhibits non-stationary distribution shifts that benefit from RevIN normalization. The model currently supports single-target forecasting in the v2 API. Pair it with a SAM-compatible optimizer for best results, though it can also be trained with standard Adam.

Theoretical Basis

Reversible Instance Normalization (RevIN):

$\hat{x} = \frac{x - μ}{σ + ϵ} \cdot γ + β$

where $μ, σ$ are per-instance statistics, and $γ, β$ are learnable affine parameters. During output, the normalization is reversed to restore the original scale.

Channel-wise scaled dot-product attention:

Given input $X \in ℝ^{B \times C \times L}$ (batch, channels, time), with $r$ denoting the hidden size:

$Q = X W_{Q}, K = X W_{K}, V = X W_{V}$

where $W_{Q}, W_{K} \in ℝ^{L \times r}$ and $W_{V} \in ℝ^{L \times L}$ .

$Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{r}}) V$

Residual forecasting:

$\hat{X} = X + Attention (Q, K, V)$

$Y = \hat{X} W_{f c}, W_{f c} \in ℝ^{L \times H}$

where $H$ is the prediction horizon. The target channel's output is extracted as the final prediction.

Sharpness-Aware Minimization (SAM):

$\hat{ϵ} = ρ \frac{\nabla_{w} ℒ (w)}{‖ \nabla_{w} ℒ (w) ‖}$

$w_{t + 1} = w_{t} - η \nabla_{w} ℒ (w_{t} + \hat{ϵ})$

SAM finds parameters that minimize not just the loss but also the sharpness of the loss landscape, leading to improved generalization.

Key hyperparameters:

hidden_size (r) -- attention embedding dimension (typical: 512)
use_revin -- enable/disable RevIN normalization
persistence_weight -- blending weight for a naive persistence term (default: 0.0)

Related Pages

Implemented By

Implementation:Sktime_Pytorch_forecasting_Samformer

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment