Workflow:Speechbrain Speechbrain Speech Separation Training

Knowledge Sources	SpeechBrain SpeechBrain Docs
Domains	Speech_Separation, Speech_Processing, Deep_Learning
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for training a neural speech separation model using SepFormer to isolate individual speakers from mixed audio signals within SpeechBrain.

Description

This workflow covers the procedure for training speech separation models that extract individual source signals from multi-speaker mixtures. It uses the SepFormer architecture (a dual-path transformer-based separator) with Scale-Invariant Signal-to-Noise Ratio (SI-SNR) loss. The process spans mixture dataset preparation, data augmentation (speed perturbation, dynamic mixing), custom batch processing with nonfinite loss handling, and evaluation with SI-SNR metrics. The recipe supports 2-speaker and 3-speaker separation on LibriMix, Aishell1Mix, WSJ0Mix, and binaural variants with optional noise conditions (WHAM, WHAMR).

Usage

Execute this workflow when you have multi-speaker audio mixtures and need to train a model that separates them into individual clean sources. This is appropriate for cocktail party problem scenarios, preprocessing for downstream ASR on overlapping speech, or building source separation components for meeting transcription systems. The workflow requires pre-generated mixture datasets (e.g., LibriMix) containing both the mixed signal and individual source references.

Execution Steps

Step 1: Mixture Dataset Preparation

Prepare the separation dataset by either generating mixtures from single-speaker corpora or loading pre-generated mixture datasets. Preparation scripts create CSV manifests mapping mixture audio files to their corresponding individual source files. For datasets like LibriMix and WSJ0Mix, this involves pairing clean utterances at specified signal-to-noise ratios and optionally adding background noise.

Key considerations:

CSV manifests must include paths to both the mixture and each individual source
Duration filtering removes utterances outside acceptable length bounds
Sorting by duration enables efficient batching (similar-length utterances grouped together)
For binaural separation, spatial configurations (independent, cross-channel, parallel) require separate setup

Step 2: Data Augmentation Configuration

Configure augmentation strategies specific to speech separation training. Dynamic mixing creates new mixture combinations on-the-fly during training, dramatically increasing effective dataset size. Speed perturbation randomly adjusts playback speed of source signals before mixing. Amplitude scaling and random time-shifting provide additional variation.

Key considerations:

Dynamic mixing requires access to original single-speaker source directories
Speed perturbation factors are typically sampled from a small range (0.95-1.05)
Augmentation is applied only during training, not validation or testing
The augmentation pipeline preprocesses data using a separate resampling step

Step 3: HyperPyYAML Model Configuration

Define the SepFormer architecture and training configuration via HyperPyYAML. The configuration specifies the encoder (convolutional front-end), the mask network (dual-path transformer blocks with intra-chunk and inter-chunk attention), the decoder, optimizer settings, and loss function. Training parameters include learning rate, batch size, gradient clipping, and number of epochs.

Key considerations:

The encoder converts raw waveform to a latent representation
Dual-path processing splits the sequence into chunks for efficient attention
The number of output masks equals the number of speakers to separate
Gradient clipping is essential for stable separation training

Step 4: Brain Subclass With Custom Batch Processing

Initialize the Separation Brain subclass which overrides fit_batch() and evaluate_batch() in addition to the standard compute_forward() and compute_objectives(). The custom batch processing handles nonfinite loss detection (common in separation due to numerical instability), applies gradient clipping, and manages the complex forward pass through encoder, mask network, and decoder.

Key considerations:

Custom fit_batch() includes nonfinite loss detection with batch skipping
Gradient clipping prevents exploding gradients common in separation tasks
compute_forward() applies augmentations, encodes mixture, estimates masks, and decodes sources
compute_objectives() computes SI-SNR loss with permutation-invariant training (PIT)

Step 5: Permutation Invariant Training

Execute training with permutation-invariant loss computation. Since the assignment of estimated sources to reference sources is unknown, the loss function evaluates all possible permutation orderings and selects the one with the lowest total loss. This ensures the model is not penalized for outputting sources in a different order than the reference.

Key considerations:

SI-SNR is computed for each possible source-to-reference assignment
The optimal permutation is selected per batch element
The negative SI-SNR is used as the loss (maximizing SI-SNR = minimizing negative SI-SNR)
Validation tracks SI-SNR improvement (SI-SNRi) over the input mixture

Step 6: Evaluation and Source Quality Assessment

Evaluate the trained model by separating test mixtures and computing SI-SNR improvement over the unseparated mixture. The best checkpoint is loaded based on validation SI-SNR, and test results are reported per utterance. Optionally, separated audio files can be saved for subjective listening evaluation.

Key considerations:

SI-SNRi measures improvement over the input mixture baseline
Results are computed per utterance and averaged across the test set
The separated waveforms can be saved for further analysis or downstream processing
Evaluation uses the same permutation-invariant assignment as training

Execution Diagram

GitHub URL

Workflow Repository