Principle:Speechbrain Speechbrain Source Separation Evaluation

Field	Value
Principle Name	Source_Separation_Evaluation
Domain(s)	Evaluation_Metrics, Speech_Separation
Description	Evaluating speech separation quality using signal-level metrics and improvement ratios
Knowledge Sources	Vincent et al. 2006 "Performance Measurement in Blind Audio Source Separation"
Related Implementation	Implementation:Speechbrain_Speechbrain_Separation_Save_Results

Overview

Evaluating the quality of speech separation requires objective metrics that quantify how faithfully the separated signals match the ground-truth clean sources. The standard approach uses Signal-to-Distortion Ratio (SDR) and Scale-Invariant Signal-to-Noise Ratio (SI-SNR), along with their improvement variants (SDRi, SI-SNRi) that measure gain over the unprocessed mixture.

Theoretical Foundation

Signal-to-Distortion Ratio (SDR)

SDR is part of the BSS_EVAL framework (Vincent et al. 2006) and decomposes the estimation error into interference, noise, and artifacts components:

s_hat = s_target + e_interf + e_noise + e_artif
SDR = 10 * log10(||s_target||^2 / ||e_interf + e_noise + e_artif||^2)

SDR is computed using the mir_eval.separation.bss_eval_sources function, which implements the least-squares projection approach to decompose the estimated signal.

Scale-Invariant Signal-to-Noise Ratio (SI-SNR)

SI-SNR (also known as SI-SDR) provides a simpler, scale-invariant metric:

s_target = (<s_hat, s> / ||s||^2) * s
e_noise  = s_hat - s_target
SI-SNR   = 10 * log10(||s_target||^2 / ||e_noise||^2)

SI-SNR is preferred in modern speech separation research because:

It is invariant to the scale (amplitude) of the estimated signal
It does not require the complex decomposition used in BSS_EVAL
It is differentiable, making it suitable as both a training loss and an evaluation metric

Improvement Metrics

Absolute metrics alone are insufficient because they do not account for the inherent difficulty of separating a particular mixture. Improvement metrics measure how much better the separated signal is compared to the original (unseparated) mixture:

SI-SNRi = SI-SNR(s, s_hat) - SI-SNR(s, mixture)
SDRi    = SDR(s, s_hat) - SDR(s, mixture)

where the baseline is computed by treating the mixture itself as the "estimate" for each source. This normalization allows fair comparison across mixtures of different difficulty levels.

Interpretation of Metrics

Metric	Unit	Higher Is Better	Typical Range
SI-SNR	dB	Yes	5-20 dB for well-trained models
SI-SNRi	dB	Yes	8-15 dB improvement
SDR	dB	Yes	5-20 dB for well-trained models
SDRi	dB	Yes	8-15 dB improvement

Per-Utterance Evaluation

Results are computed and saved per-utterance to a CSV file, enabling:

Statistical analysis of performance distribution
Identification of failure cases (low-metric examples)
Stratification by speaker, duration, or mixture conditions

The final aggregation reports mean values across the entire test set.

Audio Saving for Listening Tests

In addition to numeric metrics, the evaluation saves separated audio files for qualitative assessment through listening tests:

Estimated sources: {snt_id}_source{N}hat.wav -- the model predictions
Ground truth sources: {snt_id}_source{N}.wav -- the clean references
Mixture: {snt_id}_mix.wav -- the input mixture

All saved audio is peak-normalized to prevent clipping, and saved at the configured sample rate.

Evaluation Workflow

The complete evaluation workflow proceeds as follows:

Load test data using the same data pipeline as training
For each test batch, run the separation model in inference mode (no gradients)
Compute SI-SNR and SI-SNRi using the model's internal loss function
Compute SDR and SDRi using mir_eval.separation.bss_eval_sources
Write per-utterance results to a CSV file
Optionally save separated audio to disk
Compute and log aggregate (mean) metrics

Key Considerations

Permutation alignment: The PIT loss function handles permutation alignment during evaluation, just as during training
Batch size 1: SDR computation via bss_eval_sources requires numpy arrays and is typically done one example at a time
Negative SI-SNR convention: In SpeechBrain, the SI-SNR loss is negated for training (since optimizers minimize). During evaluation, results are re-negated to report the conventional positive-is-better SI-SNR
Computational cost: SDR computation via mir_eval is significantly slower than SI-SNR, which is why SI-SNR is used as the training loss while SDR is only computed during evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment