Principle:Speechbrain Speechbrain Perceptual Quality Evaluation

Property	Value
Principle Name	Perceptual_Quality_Evaluation
Workflow	Speech_Enhancement_Training
Domains	Evaluation_Metrics, Speech_Enhancement
Source Repository	speechbrain/speechbrain
Knowledge Sources	Hu & Loizou 2008 "Evaluation of Objective Quality Measures for Speech Enhancement"
Related Implementation	Implementation:Speechbrain_Speechbrain_Composite_Eval_Metrics

Overview

Perceptual Quality Evaluation encompasses the set of objective metrics used to assess the quality of enhanced speech. These metrics serve dual purposes in the speech enhancement workflow: (1) as monitoring metrics during training to track improvement and select the best model checkpoint, and (2) as final evaluation metrics for reporting and comparing systems on benchmark datasets. Each metric captures a different aspect of perceptual quality, and no single metric fully represents human judgment.

Theoretical Background

Why Multiple Metrics?

Speech quality is a multi-dimensional concept. A single metric cannot capture all aspects of how humans perceive enhanced speech. The key dimensions are:

Signal distortion: How much has the speech signal itself been distorted by the enhancement process?
Background noise: How effectively has background noise been removed?
Overall quality: What is the holistic impression of the enhanced signal?
Intelligibility: Can a listener understand the words being spoken?

Different metrics focus on different dimensions, making a suite of complementary metrics essential for comprehensive evaluation.

PESQ (Perceptual Evaluation of Speech Quality)

PESQ (ITU-T Recommendation P.862) is the most widely used intrusive speech quality metric. It models the human auditory system to predict the Mean Opinion Score (MOS) that human listeners would assign.

Property	Value
Range	-0.5 to 4.5 (raw); typically 1.0 to 4.5 for enhanced speech
Mode	Wideband (wb) for 16 kHz; Narrowband (nb) for 8 kHz
Interpretation	Higher is better; 4.5 = indistinguishable from clean
Computation	Compares time-aligned reference and degraded signals through perceptual model

PESQ processing pipeline:

Level alignment of reference and degraded signals
Time alignment using cross-correlation
Auditory transform (Bark scale filterbank)
Disturbance density computation
Cognitive model aggregation

PESQ is the primary metric used in SpeechBrain for checkpoint selection (max_keys=["pesq"]) and is the training target for MetricGAN+.

STOI (Short-Time Objective Intelligibility)

STOI predicts speech intelligibility rather than quality. It measures how well the temporal envelope of speech is preserved across frequency bands.

Property	Value
Range	0 to 1
Interpretation	Higher is better; 1.0 = perfect intelligibility
Computation	Correlation of short-time temporal envelopes in 1/3 octave bands
Strength	Better predictor of intelligibility than PESQ
Limitation	Less sensitive to non-linear distortions

STOI is particularly important for hearing aid applications where intelligibility matters more than subjective quality.

Composite Metrics (CSIG, CBAK, COVL)

The composite metrics introduced by Hu & Loizou (2008) are regression-based combinations of simpler signal-level measures. They predict subjective ratings on three specific quality dimensions:

CSIG (Signal Distortion)

Predicts the Mean Opinion Score for signal distortion (MOS-SIG):

CSIG = 3.093 - 1.029 * LLR + 0.603 * PESQ - 0.009 * WSS

Property	Value
Range	1 to 5
Interpretation	5 = no signal distortion
Focus	How much the speech signal itself has been damaged

CBAK (Background Noise)

Predicts the Mean Opinion Score for background intrusiveness (MOS-BAK):

CBAK = 1.634 + 0.478 * PESQ - 0.007 * WSS + 0.063 * segSNR

Property	Value
Range	1 to 5
Interpretation	5 = no background noise audible
Focus	How intrusive the remaining background noise is

COVL (Overall Quality)

Predicts the overall Mean Opinion Score (MOS-OVL):

COVL = 1.594 + 0.805 * PESQ - 0.512 * LLR - 0.007 * WSS

Property	Value
Range	1 to 5
Interpretation	5 = perfect overall quality
Focus	Holistic quality judgment

Underlying Sub-Metrics

The composite metrics are derived from three signal-level measures:

WSS (Weighted Spectral Slope): Measures spectral distortion using critical-band weighted spectral slope differences. Lower is better.
LLR (Log Likelihood Ratio): Measures spectral envelope distortion using LPC analysis. Lower is better.
SSNR (Segmental Signal-to-Noise Ratio): Frame-level SNR averaged across the utterance. Higher is better.

DNSMOS (Deep Noise Suppression MOS)

DNSMOS is a non-intrusive (reference-free) neural quality estimator developed by Microsoft for the DNS Challenge. Unlike PESQ and STOI, it does not require the clean reference signal.

Property	Value
Range	1 to 5
Model	ONNX neural network with polynomial calibration
Sub-scores	SIG (signal), BAK (background), OVRL (overall)
Advantage	No clean reference needed; fast batch evaluation
Limitation	Less precise than intrusive metrics for controlled evaluations

DNSMOS uses an ONNX model that processes 9-second audio segments and outputs raw scores that are calibrated via polynomial fitting:

p_ovr = poly1d([-0.06766283, 1.11546468, 0.04602535])
p_sig = poly1d([-0.08397278, 1.22083953, 0.0052439])
p_bak = poly1d([-0.13166888, 1.60915514, -0.39604546])

Metric Selection Guidelines

Use Case	Recommended Metrics	Rationale
Model checkpoint selection	PESQ	Best single-metric predictor of quality
Comprehensive benchmarking	PESQ + STOI + CSIG + CBAK + COVL	Covers quality, intelligibility, and sub-dimensions
Intelligibility-focused tasks	STOI	Directly predicts word recognition
Reference-free evaluation	DNSMOS	When clean reference is unavailable
MetricGAN+ training target	PESQ or STOI	Discriminator learns to predict these

Relationship to Training

The evaluation metrics connect to the training workflow in several ways:

Training monitoring: PESQ and STOI are computed on the validation set after each epoch, providing feedback on training progress
Checkpoint selection: The best model checkpoint is selected based on validation PESQ, not training loss
MetricGAN+ target: PESQ (or STOI) scores serve as training targets for the discriminator in GAN-based training
Final reporting: All metrics are computed on the held-out test set using the best checkpoint

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment