Heuristic:Speechbrain Speechbrain Nonfinite Loss Handling

Knowledge Sources	SpeechBrain SpeechBrain Core Team
Domains	Training_Stability, Debugging
Last Updated	2026-02-09 20:00 GMT

Overview

Nonfinite (NaN/Inf) loss tolerance mechanism that allows up to 3 failures per epoch before halting, with automatic gradient zeroing for nonfinite parameters.

Description

SpeechBrain's core training loop tolerates a configurable number of nonfinite (NaN or Inf) losses per epoch before raising an error. The default `nonfinite_patience=3` means up to 3 NaN/Inf losses are silently skipped per epoch. Individual parameters with nonfinite gradients are zeroed out rather than crashing the entire training. When using fp16 mixed precision with GradScaler, the `skip_nonfinite_grads` flag is redundant because GradScaler already handles this. The error message helpfully directs users to `torch.autograd.detect_anomaly()` for debugging.

Usage

This heuristic is active in all SpeechBrain training by default. Increase `nonfinite_patience` for models known to produce occasional NaN losses (e.g., attention-based models on very short sequences). Enable `skip_nonfinite_grads=True` for fp32 training where individual gradient NaNs should be masked.

The Insight (Rule of Thumb)

Action: Keep `nonfinite_patience=3` (default). Enable `skip_nonfinite_grads=True` only in fp32 mode (GradScaler handles it in fp16). If patience is exhausted, debug with `torch.autograd.detect_anomaly()`.
Value: nonfinite_patience = 3 per epoch (resets each epoch)
Trade-off: Too much patience masks genuine bugs. Too little wastes training on occasional numerical artifacts.
Important: `skip_nonfinite_grads` is automatically ignored when GradScaler is enabled (fp16 precision), as GradScaler already performs this check.

Reasoning

NaN losses are common in speech model training due to: (1) attention collapse on very short sequences, (2) empty CTC targets, (3) numerical precision issues in log-softmax with very confident predictions, (4) exploding activations in recurrent models. Crashing on the first NaN would waste potentially hours of training. However, tolerating unlimited NaNs would hide fundamental bugs. The patience of 3 per epoch is a pragmatic compromise that handles occasional numerical hiccups while alerting users to systematic problems.

Code from `speechbrain/core.py:1213-1252`:

def check_loss_isfinite(self, loss):
    if not torch.isfinite(loss):
        self.nonfinite_count += 1
        if self.nonfinite_count > self.nonfinite_patience:
            raise ValueError(
                "Loss is not finite and patience is exhausted. "
                "To debug, wrap `fit()` with "
                "autograd's `detect_anomaly()`, e.g.\n\nwith "
                "torch.autograd.detect_anomaly():\n\tbrain.fit(...)"
            )

GradScaler interaction from `speechbrain/core.py:750-755`:

if self.skip_nonfinite_grads and gradscaler_enabled:
    logger.warning(
        "The option `skip_nonfinite_grads` will be ignored "
        "because GradScaler is enabled and will automatically "
        "skip nonfinite gradients."
    )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment