Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Speechbrain Speechbrain Perceptual Quality Evaluation

From Leeroopedia


Property Value
Principle Name Perceptual_Quality_Evaluation
Workflow Speech_Enhancement_Training
Domains Evaluation_Metrics, Speech_Enhancement
Source Repository speechbrain/speechbrain
Knowledge Sources Hu & Loizou 2008 "Evaluation of Objective Quality Measures for Speech Enhancement"
Related Implementation Implementation:Speechbrain_Speechbrain_Composite_Eval_Metrics

Overview

Perceptual Quality Evaluation encompasses the set of objective metrics used to assess the quality of enhanced speech. These metrics serve dual purposes in the speech enhancement workflow: (1) as monitoring metrics during training to track improvement and select the best model checkpoint, and (2) as final evaluation metrics for reporting and comparing systems on benchmark datasets. Each metric captures a different aspect of perceptual quality, and no single metric fully represents human judgment.

Theoretical Background

Why Multiple Metrics?

Speech quality is a multi-dimensional concept. A single metric cannot capture all aspects of how humans perceive enhanced speech. The key dimensions are:

  • Signal distortion: How much has the speech signal itself been distorted by the enhancement process?
  • Background noise: How effectively has background noise been removed?
  • Overall quality: What is the holistic impression of the enhanced signal?
  • Intelligibility: Can a listener understand the words being spoken?

Different metrics focus on different dimensions, making a suite of complementary metrics essential for comprehensive evaluation.

PESQ (Perceptual Evaluation of Speech Quality)

PESQ (ITU-T Recommendation P.862) is the most widely used intrusive speech quality metric. It models the human auditory system to predict the Mean Opinion Score (MOS) that human listeners would assign.

Property Value
Range -0.5 to 4.5 (raw); typically 1.0 to 4.5 for enhanced speech
Mode Wideband (wb) for 16 kHz; Narrowband (nb) for 8 kHz
Interpretation Higher is better; 4.5 = indistinguishable from clean
Computation Compares time-aligned reference and degraded signals through perceptual model

PESQ processing pipeline:

  1. Level alignment of reference and degraded signals
  2. Time alignment using cross-correlation
  3. Auditory transform (Bark scale filterbank)
  4. Disturbance density computation
  5. Cognitive model aggregation

PESQ is the primary metric used in SpeechBrain for checkpoint selection (max_keys=["pesq"]) and is the training target for MetricGAN+.

STOI (Short-Time Objective Intelligibility)

STOI predicts speech intelligibility rather than quality. It measures how well the temporal envelope of speech is preserved across frequency bands.

Property Value
Range 0 to 1
Interpretation Higher is better; 1.0 = perfect intelligibility
Computation Correlation of short-time temporal envelopes in 1/3 octave bands
Strength Better predictor of intelligibility than PESQ
Limitation Less sensitive to non-linear distortions

STOI is particularly important for hearing aid applications where intelligibility matters more than subjective quality.

Composite Metrics (CSIG, CBAK, COVL)

The composite metrics introduced by Hu & Loizou (2008) are regression-based combinations of simpler signal-level measures. They predict subjective ratings on three specific quality dimensions:

CSIG (Signal Distortion)

Predicts the Mean Opinion Score for signal distortion (MOS-SIG):

CSIG = 3.093 - 1.029 * LLR + 0.603 * PESQ - 0.009 * WSS
Property Value
Range 1 to 5
Interpretation 5 = no signal distortion
Focus How much the speech signal itself has been damaged

CBAK (Background Noise)

Predicts the Mean Opinion Score for background intrusiveness (MOS-BAK):

CBAK = 1.634 + 0.478 * PESQ - 0.007 * WSS + 0.063 * segSNR
Property Value
Range 1 to 5
Interpretation 5 = no background noise audible
Focus How intrusive the remaining background noise is

COVL (Overall Quality)

Predicts the overall Mean Opinion Score (MOS-OVL):

COVL = 1.594 + 0.805 * PESQ - 0.512 * LLR - 0.007 * WSS
Property Value
Range 1 to 5
Interpretation 5 = perfect overall quality
Focus Holistic quality judgment

Underlying Sub-Metrics

The composite metrics are derived from three signal-level measures:

  • WSS (Weighted Spectral Slope): Measures spectral distortion using critical-band weighted spectral slope differences. Lower is better.
  • LLR (Log Likelihood Ratio): Measures spectral envelope distortion using LPC analysis. Lower is better.
  • SSNR (Segmental Signal-to-Noise Ratio): Frame-level SNR averaged across the utterance. Higher is better.

DNSMOS (Deep Noise Suppression MOS)

DNSMOS is a non-intrusive (reference-free) neural quality estimator developed by Microsoft for the DNS Challenge. Unlike PESQ and STOI, it does not require the clean reference signal.

Property Value
Range 1 to 5
Model ONNX neural network with polynomial calibration
Sub-scores SIG (signal), BAK (background), OVRL (overall)
Advantage No clean reference needed; fast batch evaluation
Limitation Less precise than intrusive metrics for controlled evaluations

DNSMOS uses an ONNX model that processes 9-second audio segments and outputs raw scores that are calibrated via polynomial fitting:

p_ovr = poly1d([-0.06766283, 1.11546468, 0.04602535])
p_sig = poly1d([-0.08397278, 1.22083953, 0.0052439])
p_bak = poly1d([-0.13166888, 1.60915514, -0.39604546])

Metric Selection Guidelines

Use Case Recommended Metrics Rationale
Model checkpoint selection PESQ Best single-metric predictor of quality
Comprehensive benchmarking PESQ + STOI + CSIG + CBAK + COVL Covers quality, intelligibility, and sub-dimensions
Intelligibility-focused tasks STOI Directly predicts word recognition
Reference-free evaluation DNSMOS When clean reference is unavailable
MetricGAN+ training target PESQ or STOI Discriminator learns to predict these

Relationship to Training

The evaluation metrics connect to the training workflow in several ways:

  1. Training monitoring: PESQ and STOI are computed on the validation set after each epoch, providing feedback on training progress
  2. Checkpoint selection: The best model checkpoint is selected based on validation PESQ, not training loss
  3. MetricGAN+ target: PESQ (or STOI) scores serve as training targets for the discriminator in GAN-based training
  4. Final reporting: All metrics are computed on the held-out test set using the best checkpoint

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment