Implementation:Speechbrain Speechbrain Composite Eval Metrics

Property	Value
Implementation Name	Composite_Eval_Metrics
API	`eval_composite(ref_wav, deg_wav, sample_rate)`, `pesq(fs, ref, deg, mode)` (external), `stoi(ref, deg, fs, extended=False)` (external)
Source File	`recipes/DNS/enhancement/composite_eval.py` (L1-466), DNSMOS: `recipes/DNS/enhancement/dnsmos_local.py` (L1-191)
Import	`from composite_eval import eval_composite`, `from pesq import pesq`, `from pystoi import stoi`
Type	External Tool Doc
Workflow	Speech_Enhancement_Training
Domains	Evaluation_Metrics, Speech_Enhancement
Related Principle	Principle:Speechbrain_Speechbrain_Perceptual_Quality_Evaluation

Purpose

This implementation covers the suite of speech quality evaluation tools used in SpeechBrain's speech enhancement recipes. The primary function eval_composite() computes composite quality metrics (CSIG, CBAK, COVL) from underlying signal-level measures (WSS, LLR, SSNR, PESQ). External packages provide PESQ and STOI evaluation, and a separate DNSMOS module provides reference-free neural quality estimation.

eval_composite Function

Signature

def eval_composite(ref_wav, deg_wav, sample_rate):
    """Evaluate audio quality metrics based on reference
    and degraded audio signals.

    Arguments
    ---------
    ref_wav : numpy.ndarray
        Clean reference audio signal (1-D array).
    deg_wav : numpy.ndarray
        Degraded/enhanced audio signal (1-D array).
    sample_rate : int
        Sample rate of the audio signals (e.g., 16000).

    Returns
    -------
    dict
        Dictionary with keys: 'pesq', 'csig', 'cbak', 'covl'.
    """

Return Values

Key	Range	Description
`pesq`	-0.5 to 4.5	Raw PESQ score (wideband at 16 kHz, narrowband otherwise)
`csig`	1 to 5	Signal distortion MOS prediction
`cbak`	1 to 5	Background intrusiveness MOS prediction
`covl`	1 to 5	Overall quality MOS prediction

Internal Computation

def eval_composite(ref_wav, deg_wav, sample_rate):
    ref_wav = ref_wav.reshape(-1)
    deg_wav = deg_wav.reshape(-1)

    alpha = 0.95
    len_ = min(ref_wav.shape[0], deg_wav.shape[0])
    ref_wav = ref_wav[:len_]
    deg_wav = deg_wav[:len_]

    # Compute sub-metrics
    wss_dist_vec = wss(ref_wav, deg_wav, sample_rate)
    wss_dist_vec = sorted(wss_dist_vec, reverse=False)
    wss_dist = np.mean(wss_dist_vec[: int(round(len(wss_dist_vec) * alpha))])

    LLR_dist = llr(ref_wav, deg_wav, sample_rate)
    LLR_dist = sorted(LLR_dist, reverse=False)
    llr_mean = np.mean(LLR_dist[: round(len(LLR_dist) * alpha)])

    snr_mean, segsnr_mean = SSNR(ref_wav, deg_wav, sample_rate)
    segSNR = np.mean(segsnr_mean)

    pesq_raw = PESQ(ref_wav, deg_wav, sample_rate)

    # Compute composite metrics via regression formulas
    Csig = 3.093 - 1.029 * llr_mean + 0.603 * pesq_raw - 0.009 * wss_dist
    Csig = trim_mos(Csig)
    Cbak = 1.634 + 0.478 * pesq_raw - 0.007 * wss_dist + 0.063 * segSNR
    Cbak = trim_mos(Cbak)
    Covl = 1.594 + 0.805 * pesq_raw - 0.512 * llr_mean - 0.007 * wss_dist
    Covl = trim_mos(Covl)

    return {"pesq": pesq_raw, "csig": Csig, "cbak": Cbak, "covl": Covl}

Sub-Metric Functions

WSS (Weighted Spectral Slope)

def wss(ref_wav, deg_wav, srate):
    """Calculate Weighted Spectral Slope distortion measure.

    Uses 25 critical band filters with Gaussian shapes.
    Computes spectral slope differences weighted by proximity
    to spectral peaks (Klatt 1982).

    Returns
    -------
    list of float
        WSS distortion value for each frame.
    """

Key parameters:

Window length: 30 ms (480 samples at 16 kHz)
Skip rate: 25% of window length
Number of critical bands: 25
FFT size: next power of 2 above 2 * window length

LLR (Log Likelihood Ratio)

def llr(ref_wav, deg_wav, srate):
    """Calculate Log Likelihood Ratio distortion measure.

    Uses LPC analysis (order 10 for <10 kHz, 16 for >=10 kHz)
    and computes ratio of spectral envelopes.

    Returns
    -------
    numpy.ndarray
        LLR distortion value for each frame.
    """

SSNR (Segmental Signal-to-Noise Ratio)

def SSNR(ref_wav, deg_wav, srate=16000, eps=1e-10):
    """Segmental Signal-to-Noise Ratio.

    Computes per-frame SNR, clipped to [-10, 35] dB range.

    Returns
    -------
    tuple
        (overall_snr, list_of_segmental_snr_values)
    """

External PESQ

Signature

from pesq import pesq

score = pesq(fs=16000, ref=ref_array, deg=deg_array, mode="wb")

Parameters

Parameter	Type	Description
`fs`	int	Sample rate (8000 for narrowband, 16000 for wideband)
`ref`	numpy.ndarray	Clean reference signal
`deg`	numpy.ndarray	Degraded/enhanced signal
`mode`	str	`"wb"` (wideband) or `"nb"` (narrowband)

Usage in SpeechBrain Training

# Normalized PESQ for MetricGAN+ (maps to 0-1 range)
def pesq_eval(pred_wav, target_wav):
    """Normalized PESQ (to 0-1)"""
    return (
        pesq(fs=16000, ref=target_wav.numpy(), deg=pred_wav.numpy(), mode="wb")
        + 0.5
    ) / 5

# Raw PESQ for SEBrain validation
def pesq_eval(pred_wav, target_wav):
    return pesq(
        fs=16000, ref=target_wav.numpy(),
        deg=pred_wav.numpy(), mode="wb",
    )

External STOI

Signature

from pystoi import stoi

score = stoi(ref, deg, fs, extended=False)

Parameters

Parameter	Type	Description
`ref`	numpy.ndarray	Clean reference signal
`deg`	numpy.ndarray	Degraded/enhanced signal
`fs`	int	Sample rate
`extended`	bool	If True, use extended STOI (better for non-linear distortions)

SpeechBrain STOI Loss (Differentiable)

SpeechBrain also provides a differentiable STOI implementation for use as a training loss or metric:

from speechbrain.nnet.loss.stoi_loss import stoi_loss

# Used as a metric during validation
stoi_metric = MetricStats(metric=stoi_loss)
stoi_metric.append(
    batch_id, predict_wav, clean_wavs, lens, reduction="batch"
)
# Note: stoi_loss returns negative STOI, so negate for reporting
stoi_value = -stoi_metric.summarize("average")

DNSMOS (ComputeScore)

Class Signature

from dnsmos_local import ComputeScore

compute_score = ComputeScore(primary_model_path="DNSMOS/sig_bak_ovr.onnx")
result = compute_score(
    fpath="enhanced_audio.wav",
    sampling_rate=16000,
    is_personalized_MOS=False
)

Return Values

Key	Range	Description
`SIG`	1 to 5	Signal quality (calibrated)
`BAK`	1 to 5	Background quality (calibrated)
`OVRL`	1 to 5	Overall quality (calibrated)
`SIG_raw`	float	Raw model output for signal
`BAK_raw`	float	Raw model output for background
`OVRL_raw`	float	Raw model output for overall

Processing Details

class ComputeScore:
    def __init__(self, primary_model_path):
        self.onnx_sess = ort.InferenceSession(primary_model_path)

    def __call__(self, fpath, sampling_rate, is_personalized_MOS):
        # Load and resample audio to 16 kHz
        aud, input_fs = sf.read(fpath)
        # Pad to minimum length (9.01 seconds)
        len_samples = int(9.01 * fs)
        while len(audio) < len_samples:
            audio = np.append(audio, audio)
        # Process in 1-second hops over 9-second segments
        for idx in range(num_hops):
            audio_seg = audio[idx * fs : int((idx + 9.01) * fs)]
            input_features = np.array(audio_seg).astype("float32")[np.newaxis, :]
            mos_sig_raw, mos_bak_raw, mos_ovr_raw = self.onnx_sess.run(
                None, {"input_1": input_features}
            )[0][0]
            # Apply polynomial calibration
            mos_sig, mos_bak, mos_ovr = self.get_polyfit_val(
                mos_sig_raw, mos_bak_raw, mos_ovr_raw, is_personalized_MOS=0
            )
        # Average across segments
        return {"SIG": np.mean(predicted_mos_sig_seg), ...}

Usage Examples

Evaluating a Single Utterance

import numpy as np
import librosa
from composite_eval import eval_composite
from pesq import pesq
from pystoi import stoi

# Load audio files
clean, sr = librosa.load("clean_speech.wav", sr=16000)
enhanced, sr = librosa.load("enhanced_speech.wav", sr=16000)

# Compute composite metrics
composite_scores = eval_composite(clean, enhanced, sample_rate=16000)
print(f"PESQ:  {composite_scores['pesq']:.3f}")
print(f"CSIG:  {composite_scores['csig']:.3f}")
print(f"CBAK:  {composite_scores['cbak']:.3f}")
print(f"COVL:  {composite_scores['covl']:.3f}")

# Compute STOI separately
stoi_score = stoi(clean, enhanced, 16000, extended=False)
print(f"STOI:  {stoi_score:.3f}")

Batch Evaluation of Test Set

import os
import numpy as np
import librosa
from composite_eval import eval_composite
from pystoi import stoi
from tqdm import tqdm

clean_dir = "/data/noisy-vctk-16k/clean_testset_wav_16k"
enhanced_dir = "results/enhanced_wavs"

all_pesq, all_csig, all_cbak, all_covl, all_stoi = [], [], [], [], []

for filename in tqdm(os.listdir(clean_dir)):
    if not filename.endswith(".wav"):
        continue

    clean, sr = librosa.load(os.path.join(clean_dir, filename), sr=16000)
    enhanced, sr = librosa.load(os.path.join(enhanced_dir, filename), sr=16000)

    # Composite metrics (includes PESQ)
    scores = eval_composite(clean, enhanced, sample_rate=16000)
    all_pesq.append(scores["pesq"])
    all_csig.append(scores["csig"])
    all_cbak.append(scores["cbak"])
    all_covl.append(scores["covl"])

    # STOI
    all_stoi.append(stoi(clean, enhanced, 16000))

print(f"PESQ:  {np.mean(all_pesq):.3f}")
print(f"CSIG:  {np.mean(all_csig):.3f}")
print(f"CBAK:  {np.mean(all_cbak):.3f}")
print(f"COVL:  {np.mean(all_covl):.3f}")
print(f"STOI:  {np.mean(all_stoi):.3f}")

DNSMOS Evaluation (Reference-Free)

from dnsmos_local import ComputeScore

# Initialize with ONNX model path
compute_score = ComputeScore("DNSMOS/sig_bak_ovr.onnx")

# Evaluate a single file (no clean reference needed)
result = compute_score(
    fpath="enhanced_speech.wav",
    sampling_rate=16000,
    is_personalized_MOS=False,
)

print(f"DNSMOS SIG:  {result['SIG']:.3f}")
print(f"DNSMOS BAK:  {result['BAK']:.3f}")
print(f"DNSMOS OVRL: {result['OVRL']:.3f}")

Dependencies

Package	Version	Used For
`pesq`	>= 0.0.3	PESQ score computation
`pystoi`	>= 0.3	STOI score computation
`numpy`	>= 1.20	Array operations for all metrics
`scipy`	>= 1.7	Toeplitz matrix for LLR computation
`librosa`	>= 0.8	Audio loading and resampling
`onnxruntime`	>= 1.10	DNSMOS model inference
`soundfile`	>= 0.10	Audio I/O for DNSMOS

Notes and Edge Cases

Length mismatch handling: eval_composite automatically truncates both signals to the minimum length, preventing errors from slight length differences
Alpha trimming: WSS and LLR values are alpha-trimmed (top 5% removed) before averaging, reducing the impact of outlier frames
MOS clamping: All composite scores are clamped to [1, 5] via trim_mos()
DNSMOS padding: Audio shorter than 9.01 seconds is padded by self-repetition
NaN handling: LLR computation uses np.nan_to_num() to handle edge cases where the ratio computation produces NaN values
Negative STOI in SpeechBrain: The stoi_loss function returns negative STOI (for use as a minimization loss), so it must be negated when reporting

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Purpose

eval_composite Function

Signature

Return Values

Internal Computation

Sub-Metric Functions

WSS (Weighted Spectral Slope)

LLR (Log Likelihood Ratio)

SSNR (Segmental Signal-to-Noise Ratio)

External PESQ

Signature

Parameters

Usage in SpeechBrain Training

External STOI

Signature

Parameters

SpeechBrain STOI Loss (Differentiable)

DNSMOS (ComputeScore)

Class Signature

Return Values

Processing Details

Usage Examples

Evaluating a Single Utterance

Batch Evaluation of Test Set

DNSMOS Evaluation (Reference-Free)

Dependencies

Notes and Edge Cases

See Also

Related Pages

Page Connections