Heuristic:Speechbrain Speechbrain Gradient Clipping Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Training_Stability |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Gradient clipping strategy using per-optimizer-group clipping with `max_grad_norm=5.0` by default, reduced to 1.0 for TTS models like Tacotron2.
Description
SpeechBrain applies gradient clipping via `torch.nn.utils.clip_grad_norm_` to prevent gradient explosion during training. A critical implementation detail is that clipping is performed per optimizer parameter group rather than over all `self.modules.parameters()` at once. This avoids a secondary numerical instability where concatenating all parameters into a single vector for norm computation can itself cause overflow/underflow. Separation recipes implement their own clipping with a conditional gate (`clip_grad_norm >= 0`), and TTS recipes require a much tighter clip.
Usage
This heuristic applies to all training in SpeechBrain. Override `max_grad_norm` via CLI argument or YAML config. Use `max_grad_norm=1.0` for Tacotron2 and TTS models. Separation recipes use `clip_grad_norm=5` in their own YAML configs.
The Insight (Rule of Thumb)
- Action: Use the default `max_grad_norm=5.0` for ASR and speaker tasks. Set `max_grad_norm=1.0` for TTS (Tacotron2). Separation recipes set `clip_grad_norm: 5` in YAML.
- Value: 5.0 (default), 1.0 (TTS), 5 (separation)
- Trade-off: Too aggressive clipping slows convergence; too loose allows gradient explosion. TTS models are especially sensitive.
- Per-optimizer clipping: Always clip per optimizer param group, never over all parameters at once, to avoid norm computation overflow.
Reasoning
Deep speech models (especially Transformers, attention-based TTS, and recurrent networks) are prone to gradient explosion on certain batches. The per-optimizer-group approach was adopted after observing that computing `clip_grad_norm_` over the concatenation of all model parameters (potentially millions) could produce NaN/Inf in the norm itself due to float32 accumulation overflow. Tacotron2 uses a tighter clip of 1.0 because its autoregressive attention mechanism is particularly sensitive to large gradient updates, which can cause attention alignment failures.
Related Pages
- Implementation:Speechbrain_Speechbrain_Brain_Fit_CTC
- Implementation:Speechbrain_Speechbrain_Separation_Fit_Batch
- Implementation:Speechbrain_Speechbrain_Tacotron2Brain_Compute_Forward
- Principle:Speechbrain_Speechbrain_CTC_Training_Loop
- Principle:Speechbrain_Speechbrain_Custom_Batch_Training_For_Separation