Workflow:Speechbrain Speechbrain Speech Separation Training
| Knowledge Sources | |
|---|---|
| Domains | Speech_Separation, Speech_Processing, Deep_Learning |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for training a neural speech separation model using SepFormer to isolate individual speakers from mixed audio signals within SpeechBrain.
Description
This workflow covers the procedure for training speech separation models that extract individual source signals from multi-speaker mixtures. It uses the SepFormer architecture (a dual-path transformer-based separator) with Scale-Invariant Signal-to-Noise Ratio (SI-SNR) loss. The process spans mixture dataset preparation, data augmentation (speed perturbation, dynamic mixing), custom batch processing with nonfinite loss handling, and evaluation with SI-SNR metrics. The recipe supports 2-speaker and 3-speaker separation on LibriMix, Aishell1Mix, WSJ0Mix, and binaural variants with optional noise conditions (WHAM, WHAMR).
Usage
Execute this workflow when you have multi-speaker audio mixtures and need to train a model that separates them into individual clean sources. This is appropriate for cocktail party problem scenarios, preprocessing for downstream ASR on overlapping speech, or building source separation components for meeting transcription systems. The workflow requires pre-generated mixture datasets (e.g., LibriMix) containing both the mixed signal and individual source references.
Execution Steps
Step 1: Mixture Dataset Preparation
Prepare the separation dataset by either generating mixtures from single-speaker corpora or loading pre-generated mixture datasets. Preparation scripts create CSV manifests mapping mixture audio files to their corresponding individual source files. For datasets like LibriMix and WSJ0Mix, this involves pairing clean utterances at specified signal-to-noise ratios and optionally adding background noise.
Key considerations:
- CSV manifests must include paths to both the mixture and each individual source
- Duration filtering removes utterances outside acceptable length bounds
- Sorting by duration enables efficient batching (similar-length utterances grouped together)
- For binaural separation, spatial configurations (independent, cross-channel, parallel) require separate setup
Step 2: Data Augmentation Configuration
Configure augmentation strategies specific to speech separation training. Dynamic mixing creates new mixture combinations on-the-fly during training, dramatically increasing effective dataset size. Speed perturbation randomly adjusts playback speed of source signals before mixing. Amplitude scaling and random time-shifting provide additional variation.
Key considerations:
- Dynamic mixing requires access to original single-speaker source directories
- Speed perturbation factors are typically sampled from a small range (0.95-1.05)
- Augmentation is applied only during training, not validation or testing
- The augmentation pipeline preprocesses data using a separate resampling step
Step 3: HyperPyYAML Model Configuration
Define the SepFormer architecture and training configuration via HyperPyYAML. The configuration specifies the encoder (convolutional front-end), the mask network (dual-path transformer blocks with intra-chunk and inter-chunk attention), the decoder, optimizer settings, and loss function. Training parameters include learning rate, batch size, gradient clipping, and number of epochs.
Key considerations:
- The encoder converts raw waveform to a latent representation
- Dual-path processing splits the sequence into chunks for efficient attention
- The number of output masks equals the number of speakers to separate
- Gradient clipping is essential for stable separation training
Step 4: Brain Subclass With Custom Batch Processing
Initialize the Separation Brain subclass which overrides fit_batch() and evaluate_batch() in addition to the standard compute_forward() and compute_objectives(). The custom batch processing handles nonfinite loss detection (common in separation due to numerical instability), applies gradient clipping, and manages the complex forward pass through encoder, mask network, and decoder.
Key considerations:
- Custom fit_batch() includes nonfinite loss detection with batch skipping
- Gradient clipping prevents exploding gradients common in separation tasks
- compute_forward() applies augmentations, encodes mixture, estimates masks, and decodes sources
- compute_objectives() computes SI-SNR loss with permutation-invariant training (PIT)
Step 5: Permutation Invariant Training
Execute training with permutation-invariant loss computation. Since the assignment of estimated sources to reference sources is unknown, the loss function evaluates all possible permutation orderings and selects the one with the lowest total loss. This ensures the model is not penalized for outputting sources in a different order than the reference.
Key considerations:
- SI-SNR is computed for each possible source-to-reference assignment
- The optimal permutation is selected per batch element
- The negative SI-SNR is used as the loss (maximizing SI-SNR = minimizing negative SI-SNR)
- Validation tracks SI-SNR improvement (SI-SNRi) over the input mixture
Step 6: Evaluation and Source Quality Assessment
Evaluate the trained model by separating test mixtures and computing SI-SNR improvement over the unseparated mixture. The best checkpoint is loaded based on validation SI-SNR, and test results are reported per utterance. Optionally, separated audio files can be saved for subjective listening evaluation.
Key considerations:
- SI-SNRi measures improvement over the input mixture baseline
- Results are computed per utterance and averaged across the test set
- The separated waveforms can be saved for further analysis or downstream processing
- Evaluation uses the same permutation-invariant assignment as training