Heuristic:Speechbrain Speechbrain Gradient Clipping Strategy

Knowledge Sources	SpeechBrain SpeechBrain Core Team
Domains	Optimization, Training_Stability
Last Updated	2026-02-09 20:00 GMT

Overview

Gradient clipping strategy using per-optimizer-group clipping with `max_grad_norm=5.0` by default, reduced to 1.0 for TTS models like Tacotron2.

Description

SpeechBrain applies gradient clipping via `torch.nn.utils.clip_grad_norm_` to prevent gradient explosion during training. A critical implementation detail is that clipping is performed per optimizer parameter group rather than over all `self.modules.parameters()` at once. This avoids a secondary numerical instability where concatenating all parameters into a single vector for norm computation can itself cause overflow/underflow. Separation recipes implement their own clipping with a conditional gate (`clip_grad_norm >= 0`), and TTS recipes require a much tighter clip.

Usage

This heuristic applies to all training in SpeechBrain. Override `max_grad_norm` via CLI argument or YAML config. Use `max_grad_norm=1.0` for Tacotron2 and TTS models. Separation recipes use `clip_grad_norm=5` in their own YAML configs.

The Insight (Rule of Thumb)

Action: Use the default `max_grad_norm=5.0` for ASR and speaker tasks. Set `max_grad_norm=1.0` for TTS (Tacotron2). Separation recipes set `clip_grad_norm: 5` in YAML.
Value: 5.0 (default), 1.0 (TTS), 5 (separation)
Trade-off: Too aggressive clipping slows convergence; too loose allows gradient explosion. TTS models are especially sensitive.
Per-optimizer clipping: Always clip per optimizer param group, never over all parameters at once, to avoid norm computation overflow.

Reasoning

Deep speech models (especially Transformers, attention-based TTS, and recurrent networks) are prone to gradient explosion on certain batches. The per-optimizer-group approach was adopted after observing that computing `clip_grad_norm_` over the concatenation of all model parameters (potentially millions) could produce NaN/Inf in the norm itself due to float32 accumulation overflow. Tacotron2 uses a tighter clip of 1.0 because its autoregressive attention mechanism is particularly sensitive to large gradient updates, which can cause attention alignment failures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment