Principle:Speechbrain Speechbrain Source Separation Evaluation
| Field | Value |
|---|---|
| Principle Name | Source_Separation_Evaluation |
| Domain(s) | Evaluation_Metrics, Speech_Separation |
| Description | Evaluating speech separation quality using signal-level metrics and improvement ratios |
| Knowledge Sources | Vincent et al. 2006 "Performance Measurement in Blind Audio Source Separation" |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Separation_Save_Results |
Overview
Evaluating the quality of speech separation requires objective metrics that quantify how faithfully the separated signals match the ground-truth clean sources. The standard approach uses Signal-to-Distortion Ratio (SDR) and Scale-Invariant Signal-to-Noise Ratio (SI-SNR), along with their improvement variants (SDRi, SI-SNRi) that measure gain over the unprocessed mixture.
Theoretical Foundation
Signal-to-Distortion Ratio (SDR)
SDR is part of the BSS_EVAL framework (Vincent et al. 2006) and decomposes the estimation error into interference, noise, and artifacts components:
s_hat = s_target + e_interf + e_noise + e_artif
SDR = 10 * log10(||s_target||^2 / ||e_interf + e_noise + e_artif||^2)
SDR is computed using the mir_eval.separation.bss_eval_sources function, which implements the least-squares projection approach to decompose the estimated signal.
Scale-Invariant Signal-to-Noise Ratio (SI-SNR)
SI-SNR (also known as SI-SDR) provides a simpler, scale-invariant metric:
s_target = (<s_hat, s> / ||s||^2) * s
e_noise = s_hat - s_target
SI-SNR = 10 * log10(||s_target||^2 / ||e_noise||^2)
SI-SNR is preferred in modern speech separation research because:
- It is invariant to the scale (amplitude) of the estimated signal
- It does not require the complex decomposition used in BSS_EVAL
- It is differentiable, making it suitable as both a training loss and an evaluation metric
Improvement Metrics
Absolute metrics alone are insufficient because they do not account for the inherent difficulty of separating a particular mixture. Improvement metrics measure how much better the separated signal is compared to the original (unseparated) mixture:
SI-SNRi = SI-SNR(s, s_hat) - SI-SNR(s, mixture)
SDRi = SDR(s, s_hat) - SDR(s, mixture)
where the baseline is computed by treating the mixture itself as the "estimate" for each source. This normalization allows fair comparison across mixtures of different difficulty levels.
Interpretation of Metrics
| Metric | Unit | Higher Is Better | Typical Range |
|---|---|---|---|
| SI-SNR | dB | Yes | 5-20 dB for well-trained models |
| SI-SNRi | dB | Yes | 8-15 dB improvement |
| SDR | dB | Yes | 5-20 dB for well-trained models |
| SDRi | dB | Yes | 8-15 dB improvement |
Per-Utterance Evaluation
Results are computed and saved per-utterance to a CSV file, enabling:
- Statistical analysis of performance distribution
- Identification of failure cases (low-metric examples)
- Stratification by speaker, duration, or mixture conditions
The final aggregation reports mean values across the entire test set.
Audio Saving for Listening Tests
In addition to numeric metrics, the evaluation saves separated audio files for qualitative assessment through listening tests:
- Estimated sources:
{snt_id}_source{N}hat.wav-- the model predictions - Ground truth sources:
{snt_id}_source{N}.wav-- the clean references - Mixture:
{snt_id}_mix.wav-- the input mixture
All saved audio is peak-normalized to prevent clipping, and saved at the configured sample rate.
Evaluation Workflow
The complete evaluation workflow proceeds as follows:
- Load test data using the same data pipeline as training
- For each test batch, run the separation model in inference mode (no gradients)
- Compute SI-SNR and SI-SNRi using the model's internal loss function
- Compute SDR and SDRi using
mir_eval.separation.bss_eval_sources - Write per-utterance results to a CSV file
- Optionally save separated audio to disk
- Compute and log aggregate (mean) metrics
Key Considerations
- Permutation alignment: The PIT loss function handles permutation alignment during evaluation, just as during training
- Batch size 1: SDR computation via
bss_eval_sourcesrequires numpy arrays and is typically done one example at a time - Negative SI-SNR convention: In SpeechBrain, the SI-SNR loss is negated for training (since optimizers minimize). During evaluation, results are re-negated to report the conventional positive-is-better SI-SNR
- Computational cost: SDR computation via mir_eval is significantly slower than SI-SNR, which is why SI-SNR is used as the training loss while SDR is only computed during evaluation