Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Source Separation Evaluation

From Leeroopedia


Field Value
Principle Name Source_Separation_Evaluation
Domain(s) Evaluation_Metrics, Speech_Separation
Description Evaluating speech separation quality using signal-level metrics and improvement ratios
Knowledge Sources Vincent et al. 2006 "Performance Measurement in Blind Audio Source Separation"
Related Implementation Implementation:Speechbrain_Speechbrain_Separation_Save_Results

Overview

Evaluating the quality of speech separation requires objective metrics that quantify how faithfully the separated signals match the ground-truth clean sources. The standard approach uses Signal-to-Distortion Ratio (SDR) and Scale-Invariant Signal-to-Noise Ratio (SI-SNR), along with their improvement variants (SDRi, SI-SNRi) that measure gain over the unprocessed mixture.

Theoretical Foundation

Signal-to-Distortion Ratio (SDR)

SDR is part of the BSS_EVAL framework (Vincent et al. 2006) and decomposes the estimation error into interference, noise, and artifacts components:

s_hat = s_target + e_interf + e_noise + e_artif
SDR = 10 * log10(||s_target||^2 / ||e_interf + e_noise + e_artif||^2)

SDR is computed using the mir_eval.separation.bss_eval_sources function, which implements the least-squares projection approach to decompose the estimated signal.

Scale-Invariant Signal-to-Noise Ratio (SI-SNR)

SI-SNR (also known as SI-SDR) provides a simpler, scale-invariant metric:

s_target = (<s_hat, s> / ||s||^2) * s
e_noise  = s_hat - s_target
SI-SNR   = 10 * log10(||s_target||^2 / ||e_noise||^2)

SI-SNR is preferred in modern speech separation research because:

  • It is invariant to the scale (amplitude) of the estimated signal
  • It does not require the complex decomposition used in BSS_EVAL
  • It is differentiable, making it suitable as both a training loss and an evaluation metric

Improvement Metrics

Absolute metrics alone are insufficient because they do not account for the inherent difficulty of separating a particular mixture. Improvement metrics measure how much better the separated signal is compared to the original (unseparated) mixture:

SI-SNRi = SI-SNR(s, s_hat) - SI-SNR(s, mixture)
SDRi    = SDR(s, s_hat) - SDR(s, mixture)

where the baseline is computed by treating the mixture itself as the "estimate" for each source. This normalization allows fair comparison across mixtures of different difficulty levels.

Interpretation of Metrics

Metric Unit Higher Is Better Typical Range
SI-SNR dB Yes 5-20 dB for well-trained models
SI-SNRi dB Yes 8-15 dB improvement
SDR dB Yes 5-20 dB for well-trained models
SDRi dB Yes 8-15 dB improvement

Per-Utterance Evaluation

Results are computed and saved per-utterance to a CSV file, enabling:

  • Statistical analysis of performance distribution
  • Identification of failure cases (low-metric examples)
  • Stratification by speaker, duration, or mixture conditions

The final aggregation reports mean values across the entire test set.

Audio Saving for Listening Tests

In addition to numeric metrics, the evaluation saves separated audio files for qualitative assessment through listening tests:

  • Estimated sources: {snt_id}_source{N}hat.wav -- the model predictions
  • Ground truth sources: {snt_id}_source{N}.wav -- the clean references
  • Mixture: {snt_id}_mix.wav -- the input mixture

All saved audio is peak-normalized to prevent clipping, and saved at the configured sample rate.

Evaluation Workflow

The complete evaluation workflow proceeds as follows:

  1. Load test data using the same data pipeline as training
  2. For each test batch, run the separation model in inference mode (no gradients)
  3. Compute SI-SNR and SI-SNRi using the model's internal loss function
  4. Compute SDR and SDRi using mir_eval.separation.bss_eval_sources
  5. Write per-utterance results to a CSV file
  6. Optionally save separated audio to disk
  7. Compute and log aggregate (mean) metrics

Key Considerations

  • Permutation alignment: The PIT loss function handles permutation alignment during evaluation, just as during training
  • Batch size 1: SDR computation via bss_eval_sources requires numpy arrays and is typically done one example at a time
  • Negative SI-SNR convention: In SpeechBrain, the SI-SNR loss is negated for training (since optimizers minimize). During evaluation, results are re-negated to report the conventional positive-is-better SI-SNR
  • Computational cost: SDR computation via mir_eval is significantly slower than SI-SNR, which is why SI-SNR is used as the training loss while SDR is only computed during evaluation

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment