Heuristic:Sktime Pytorch forecasting Gradient Clipping Value

Knowledge Sources	pytorch-forecasting Getting Started
Domains	Optimization, Deep_Learning, Time_Series
Last Updated	2026-02-08 08:00 GMT

Overview

Use `gradient_clip_val=0.1` in PyTorch Lightning Trainer to prevent gradient explosions in transformer and RNN-based forecasting models.

Description

Gradient clipping by norm is a critical training hyperparameter for all pytorch-forecasting models. The value 0.1 is used universally across every example, tutorial, test, and documentation page in the repository. This is set on the Lightning Trainer (not the model itself), and limits the L2 norm of gradients during backpropagation. The hyperparameter tuning module searches within the range (0.01, 1.0) in practice, though the default search range is (0.01, 100.0).

Usage

Apply this heuristic always when configuring a Lightning Trainer for any pytorch-forecasting model. Without gradient clipping, attention-based models (TFT, TimeXer) and RNN-based models (DeepAR, RecurrentNetwork) are prone to gradient explosions and training divergence. The stallion tutorial explicitly notes: "clipping gradients is a hyperparameter and important to prevent divergence."

The Insight (Rule of Thumb)

Action: Set `gradient_clip_val=0.1` in `pl.Trainer(...)`.
Value: 0.1 (canonical), practical range (0.01, 1.0).
Trade-off: Too low (< 0.01) slows convergence by clipping informative gradients. Too high (> 1.0) provides insufficient protection against gradient explosions.
Scope: All pytorch-forecasting models. This is a Trainer-level parameter, not a model parameter.

Reasoning

Transformer self-attention and RNN hidden state dynamics amplify gradients, especially with long encoder sequences. Without clipping, a single batch with extreme values can cause gradient norms to spike, producing NaN losses or divergent weights. The value 0.1 is aggressive enough to prevent this while still allowing meaningful gradient updates. The hyperparameter tuning function narrows the search from the default (0.01, 100.0) to (0.01, 1.0) in practice because values above 1.0 rarely improve training.

Code evidence from `examples/stallion.py:121`:

trainer = pl.Trainer(
    max_epochs=50,
    accelerator="auto",
    gradient_clip_val=0.1,
)

Consistent across all examples:

`examples/stallion.py:121` — `gradient_clip_val=0.1`
`examples/ar.py:74` — `gradient_clip_val=0.1`
`examples/nbeats.py:63` — `gradient_clip_val=0.1`
`README.md:131` — `gradient_clip_val=0.1`
`docs/source/getting-started.rst:117` — `gradient_clip_val=0.1`

Tuning range from `models/temporal_fusion_transformer/tuning.py:53`:

gradient_clip_val_range: tuple[float, float] = (0.01, 100.0)

Practical narrowing from `examples/stallion.py:178`:

gradient_clip_val_range=(0.01, 1.0)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment