Principle:Speechbrain Speechbrain Perceptual Quality Evaluation
| Property | Value |
|---|---|
| Principle Name | Perceptual_Quality_Evaluation |
| Workflow | Speech_Enhancement_Training |
| Domains | Evaluation_Metrics, Speech_Enhancement |
| Source Repository | speechbrain/speechbrain |
| Knowledge Sources | Hu & Loizou 2008 "Evaluation of Objective Quality Measures for Speech Enhancement" |
| Related Implementation | Implementation:Speechbrain_Speechbrain_Composite_Eval_Metrics |
Overview
Perceptual Quality Evaluation encompasses the set of objective metrics used to assess the quality of enhanced speech. These metrics serve dual purposes in the speech enhancement workflow: (1) as monitoring metrics during training to track improvement and select the best model checkpoint, and (2) as final evaluation metrics for reporting and comparing systems on benchmark datasets. Each metric captures a different aspect of perceptual quality, and no single metric fully represents human judgment.
Theoretical Background
Why Multiple Metrics?
Speech quality is a multi-dimensional concept. A single metric cannot capture all aspects of how humans perceive enhanced speech. The key dimensions are:
- Signal distortion: How much has the speech signal itself been distorted by the enhancement process?
- Background noise: How effectively has background noise been removed?
- Overall quality: What is the holistic impression of the enhanced signal?
- Intelligibility: Can a listener understand the words being spoken?
Different metrics focus on different dimensions, making a suite of complementary metrics essential for comprehensive evaluation.
PESQ (Perceptual Evaluation of Speech Quality)
PESQ (ITU-T Recommendation P.862) is the most widely used intrusive speech quality metric. It models the human auditory system to predict the Mean Opinion Score (MOS) that human listeners would assign.
| Property | Value |
|---|---|
| Range | -0.5 to 4.5 (raw); typically 1.0 to 4.5 for enhanced speech |
| Mode | Wideband (wb) for 16 kHz; Narrowband (nb) for 8 kHz |
| Interpretation | Higher is better; 4.5 = indistinguishable from clean |
| Computation | Compares time-aligned reference and degraded signals through perceptual model |
PESQ processing pipeline:
- Level alignment of reference and degraded signals
- Time alignment using cross-correlation
- Auditory transform (Bark scale filterbank)
- Disturbance density computation
- Cognitive model aggregation
PESQ is the primary metric used in SpeechBrain for checkpoint selection (max_keys=["pesq"]) and is the training target for MetricGAN+.
STOI (Short-Time Objective Intelligibility)
STOI predicts speech intelligibility rather than quality. It measures how well the temporal envelope of speech is preserved across frequency bands.
| Property | Value |
|---|---|
| Range | 0 to 1 |
| Interpretation | Higher is better; 1.0 = perfect intelligibility |
| Computation | Correlation of short-time temporal envelopes in 1/3 octave bands |
| Strength | Better predictor of intelligibility than PESQ |
| Limitation | Less sensitive to non-linear distortions |
STOI is particularly important for hearing aid applications where intelligibility matters more than subjective quality.
Composite Metrics (CSIG, CBAK, COVL)
The composite metrics introduced by Hu & Loizou (2008) are regression-based combinations of simpler signal-level measures. They predict subjective ratings on three specific quality dimensions:
CSIG (Signal Distortion)
Predicts the Mean Opinion Score for signal distortion (MOS-SIG):
CSIG = 3.093 - 1.029 * LLR + 0.603 * PESQ - 0.009 * WSS
| Property | Value |
|---|---|
| Range | 1 to 5 |
| Interpretation | 5 = no signal distortion |
| Focus | How much the speech signal itself has been damaged |
CBAK (Background Noise)
Predicts the Mean Opinion Score for background intrusiveness (MOS-BAK):
CBAK = 1.634 + 0.478 * PESQ - 0.007 * WSS + 0.063 * segSNR
| Property | Value |
|---|---|
| Range | 1 to 5 |
| Interpretation | 5 = no background noise audible |
| Focus | How intrusive the remaining background noise is |
COVL (Overall Quality)
Predicts the overall Mean Opinion Score (MOS-OVL):
COVL = 1.594 + 0.805 * PESQ - 0.512 * LLR - 0.007 * WSS
| Property | Value |
|---|---|
| Range | 1 to 5 |
| Interpretation | 5 = perfect overall quality |
| Focus | Holistic quality judgment |
Underlying Sub-Metrics
The composite metrics are derived from three signal-level measures:
- WSS (Weighted Spectral Slope): Measures spectral distortion using critical-band weighted spectral slope differences. Lower is better.
- LLR (Log Likelihood Ratio): Measures spectral envelope distortion using LPC analysis. Lower is better.
- SSNR (Segmental Signal-to-Noise Ratio): Frame-level SNR averaged across the utterance. Higher is better.
DNSMOS (Deep Noise Suppression MOS)
DNSMOS is a non-intrusive (reference-free) neural quality estimator developed by Microsoft for the DNS Challenge. Unlike PESQ and STOI, it does not require the clean reference signal.
| Property | Value |
|---|---|
| Range | 1 to 5 |
| Model | ONNX neural network with polynomial calibration |
| Sub-scores | SIG (signal), BAK (background), OVRL (overall) |
| Advantage | No clean reference needed; fast batch evaluation |
| Limitation | Less precise than intrusive metrics for controlled evaluations |
DNSMOS uses an ONNX model that processes 9-second audio segments and outputs raw scores that are calibrated via polynomial fitting:
p_ovr = poly1d([-0.06766283, 1.11546468, 0.04602535])
p_sig = poly1d([-0.08397278, 1.22083953, 0.0052439])
p_bak = poly1d([-0.13166888, 1.60915514, -0.39604546])
Metric Selection Guidelines
| Use Case | Recommended Metrics | Rationale |
|---|---|---|
| Model checkpoint selection | PESQ | Best single-metric predictor of quality |
| Comprehensive benchmarking | PESQ + STOI + CSIG + CBAK + COVL | Covers quality, intelligibility, and sub-dimensions |
| Intelligibility-focused tasks | STOI | Directly predicts word recognition |
| Reference-free evaluation | DNSMOS | When clean reference is unavailable |
| MetricGAN+ training target | PESQ or STOI | Discriminator learns to predict these |
Relationship to Training
The evaluation metrics connect to the training workflow in several ways:
- Training monitoring: PESQ and STOI are computed on the validation set after each epoch, providing feedback on training progress
- Checkpoint selection: The best model checkpoint is selected based on validation PESQ, not training loss
- MetricGAN+ target: PESQ (or STOI) scores serve as training targets for the discriminator in GAN-based training
- Final reporting: All metrics are computed on the held-out test set using the best checkpoint
See Also
- Implementation:Speechbrain_Speechbrain_Composite_Eval_Metrics -- The concrete implementations of composite metrics, PESQ, STOI, and DNSMOS
- Principle:Speechbrain_Speechbrain_GAN_Based_Enhancement_Training -- How PESQ is used as a training target in MetricGAN+
- Principle:Speechbrain_Speechbrain_Conventional_Enhancement_Training -- How metrics are used for monitoring in conventional training