Heuristic:Speechbrain Speechbrain Nonfinite Loss Handling
| Knowledge Sources | |
|---|---|
| Domains | Training_Stability, Debugging |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Nonfinite (NaN/Inf) loss tolerance mechanism that allows up to 3 failures per epoch before halting, with automatic gradient zeroing for nonfinite parameters.
Description
SpeechBrain's core training loop tolerates a configurable number of nonfinite (NaN or Inf) losses per epoch before raising an error. The default `nonfinite_patience=3` means up to 3 NaN/Inf losses are silently skipped per epoch. Individual parameters with nonfinite gradients are zeroed out rather than crashing the entire training. When using fp16 mixed precision with GradScaler, the `skip_nonfinite_grads` flag is redundant because GradScaler already handles this. The error message helpfully directs users to `torch.autograd.detect_anomaly()` for debugging.
Usage
This heuristic is active in all SpeechBrain training by default. Increase `nonfinite_patience` for models known to produce occasional NaN losses (e.g., attention-based models on very short sequences). Enable `skip_nonfinite_grads=True` for fp32 training where individual gradient NaNs should be masked.
The Insight (Rule of Thumb)
- Action: Keep `nonfinite_patience=3` (default). Enable `skip_nonfinite_grads=True` only in fp32 mode (GradScaler handles it in fp16). If patience is exhausted, debug with `torch.autograd.detect_anomaly()`.
- Value: nonfinite_patience = 3 per epoch (resets each epoch)
- Trade-off: Too much patience masks genuine bugs. Too little wastes training on occasional numerical artifacts.
- Important: `skip_nonfinite_grads` is automatically ignored when GradScaler is enabled (fp16 precision), as GradScaler already performs this check.
Reasoning
NaN losses are common in speech model training due to: (1) attention collapse on very short sequences, (2) empty CTC targets, (3) numerical precision issues in log-softmax with very confident predictions, (4) exploding activations in recurrent models. Crashing on the first NaN would waste potentially hours of training. However, tolerating unlimited NaNs would hide fundamental bugs. The patience of 3 per epoch is a pragmatic compromise that handles occasional numerical hiccups while alerting users to systematic problems.
Code from `speechbrain/core.py:1213-1252`:
def check_loss_isfinite(self, loss):
if not torch.isfinite(loss):
self.nonfinite_count += 1
if self.nonfinite_count > self.nonfinite_patience:
raise ValueError(
"Loss is not finite and patience is exhausted. "
"To debug, wrap `fit()` with "
"autograd's `detect_anomaly()`, e.g.\n\nwith "
"torch.autograd.detect_anomaly():\n\tbrain.fit(...)"
)
GradScaler interaction from `speechbrain/core.py:750-755`:
if self.skip_nonfinite_grads and gradscaler_enabled:
logger.warning(
"The option `skip_nonfinite_grads` will be ignored "
"because GradScaler is enabled and will automatically "
"skip nonfinite gradients."
)