Heuristic:Speechbrain Speechbrain Data Augmentation Defaults
| Knowledge Sources | |
|---|---|
| Domains | Data_Augmentation, Speech_Recognition |
| Last Updated | 2026-02-09 20:00 GMT |
Overview
Default data augmentation parameters for SpeechBrain: speed perturbation at [90, 100, 110]%, frequency drop width 5%, and time chunk drop of 100-1000 samples.
Description
SpeechBrain provides a suite of time-domain augmentation transforms (SpeedPerturb, AddNoise, DropFreq, DropChunk) with empirically-derived defaults from the Kaldi/ESPnet ASR literature. Speed perturbation of +/- 10% is the most impactful single augmentation for ASR. DropFreq and DropChunk implement SpecAugment-style masking in the waveform domain. The Augmenter class orchestrates multiple augmentations with configurable min/max selection.
Usage
Apply to ASR training and speaker verification training recipes. Configure augmentations in the YAML hyperparameter files. The defaults work well for most tasks; adjust for domain-specific needs.
The Insight (Rule of Thumb)
- Action: Use `speeds=[90, 100, 110]` for SpeedPerturb. Use `drop_freq_width=0.05` and `drop_freq_count=1-3` for DropFreq. Use `drop_length=100-1000` and `drop_count=1-3` for DropChunk.
- Value: Speed: 90/100/110%; FreqDrop: 5% width, 1-3 bands; ChunkDrop: 100-1000 samples, 1-3 chunks.
- Trade-off: More aggressive augmentation improves generalization but can slow convergence and hurt performance on clean test sets.
- Warning: `AddNoise` with `pad_noise=True` can be extremely slow if noise clips are much shorter than clean signals due to a while-loop padding mechanism.
Reasoning
Speed perturbation at +/- 10% is proven in the Kaldi literature (Ko et al., 2015) to be one of the most effective augmentations for ASR, as it simulates natural speaking rate variation. The 100 in the speeds list means some samples are kept at original speed, preventing augmentation from being too aggressive. DropFreq with 5% bandwidth (1-3 bands) forces models to not rely on narrow frequency features. DropChunk at 100-1000 samples (6-62ms at 16kHz) is shorter than most phonemes, ensuring partial but not total information loss.
Code from `speechbrain/augment/time_domain.py:449-479`:
class SpeedPerturb(torch.nn.Module):
def __init__(self, orig_freq, speeds=[90, 100, 110], device="cpu"):
...
AddNoise padding warning from `speechbrain/augment/time_domain.py:232`:
# WARNING: THIS COULD BE SLOW IF THERE ARE VERY SHORT NOISES
if self.pad_noise:
while torch.any(noise_len < lengths):
min_len = torch.min(noise_len)
prepend = noise_batch[:, :min_len]
noise_batch = torch.cat((prepend, noise_batch), axis=1)
noise_len += min_len