Implementation:Speechbrain Speechbrain Composite Eval Metrics
Appearance
| Property | Value |
|---|---|
| Implementation Name | Composite_Eval_Metrics |
| API | eval_composite(ref_wav, deg_wav, sample_rate), pesq(fs, ref, deg, mode) (external), stoi(ref, deg, fs, extended=False) (external)
|
| Source File | recipes/DNS/enhancement/composite_eval.py (L1-466), DNSMOS: recipes/DNS/enhancement/dnsmos_local.py (L1-191)
|
| Import | from composite_eval import eval_composite, from pesq import pesq, from pystoi import stoi
|
| Type | External Tool Doc |
| Workflow | Speech_Enhancement_Training |
| Domains | Evaluation_Metrics, Speech_Enhancement |
| Related Principle | Principle:Speechbrain_Speechbrain_Perceptual_Quality_Evaluation |
Purpose
This implementation covers the suite of speech quality evaluation tools used in SpeechBrain's speech enhancement recipes. The primary function eval_composite() computes composite quality metrics (CSIG, CBAK, COVL) from underlying signal-level measures (WSS, LLR, SSNR, PESQ). External packages provide PESQ and STOI evaluation, and a separate DNSMOS module provides reference-free neural quality estimation.
eval_composite Function
Signature
def eval_composite(ref_wav, deg_wav, sample_rate):
"""Evaluate audio quality metrics based on reference
and degraded audio signals.
Arguments
---------
ref_wav : numpy.ndarray
Clean reference audio signal (1-D array).
deg_wav : numpy.ndarray
Degraded/enhanced audio signal (1-D array).
sample_rate : int
Sample rate of the audio signals (e.g., 16000).
Returns
-------
dict
Dictionary with keys: 'pesq', 'csig', 'cbak', 'covl'.
"""
Return Values
| Key | Range | Description |
|---|---|---|
pesq |
-0.5 to 4.5 | Raw PESQ score (wideband at 16 kHz, narrowband otherwise) |
csig |
1 to 5 | Signal distortion MOS prediction |
cbak |
1 to 5 | Background intrusiveness MOS prediction |
covl |
1 to 5 | Overall quality MOS prediction |
Internal Computation
def eval_composite(ref_wav, deg_wav, sample_rate):
ref_wav = ref_wav.reshape(-1)
deg_wav = deg_wav.reshape(-1)
alpha = 0.95
len_ = min(ref_wav.shape[0], deg_wav.shape[0])
ref_wav = ref_wav[:len_]
deg_wav = deg_wav[:len_]
# Compute sub-metrics
wss_dist_vec = wss(ref_wav, deg_wav, sample_rate)
wss_dist_vec = sorted(wss_dist_vec, reverse=False)
wss_dist = np.mean(wss_dist_vec[: int(round(len(wss_dist_vec) * alpha))])
LLR_dist = llr(ref_wav, deg_wav, sample_rate)
LLR_dist = sorted(LLR_dist, reverse=False)
llr_mean = np.mean(LLR_dist[: round(len(LLR_dist) * alpha)])
snr_mean, segsnr_mean = SSNR(ref_wav, deg_wav, sample_rate)
segSNR = np.mean(segsnr_mean)
pesq_raw = PESQ(ref_wav, deg_wav, sample_rate)
# Compute composite metrics via regression formulas
Csig = 3.093 - 1.029 * llr_mean + 0.603 * pesq_raw - 0.009 * wss_dist
Csig = trim_mos(Csig)
Cbak = 1.634 + 0.478 * pesq_raw - 0.007 * wss_dist + 0.063 * segSNR
Cbak = trim_mos(Cbak)
Covl = 1.594 + 0.805 * pesq_raw - 0.512 * llr_mean - 0.007 * wss_dist
Covl = trim_mos(Covl)
return {"pesq": pesq_raw, "csig": Csig, "cbak": Cbak, "covl": Covl}
Sub-Metric Functions
WSS (Weighted Spectral Slope)
def wss(ref_wav, deg_wav, srate):
"""Calculate Weighted Spectral Slope distortion measure.
Uses 25 critical band filters with Gaussian shapes.
Computes spectral slope differences weighted by proximity
to spectral peaks (Klatt 1982).
Returns
-------
list of float
WSS distortion value for each frame.
"""
Key parameters:
- Window length: 30 ms (480 samples at 16 kHz)
- Skip rate: 25% of window length
- Number of critical bands: 25
- FFT size: next power of 2 above 2 * window length
LLR (Log Likelihood Ratio)
def llr(ref_wav, deg_wav, srate):
"""Calculate Log Likelihood Ratio distortion measure.
Uses LPC analysis (order 10 for <10 kHz, 16 for >=10 kHz)
and computes ratio of spectral envelopes.
Returns
-------
numpy.ndarray
LLR distortion value for each frame.
"""
SSNR (Segmental Signal-to-Noise Ratio)
def SSNR(ref_wav, deg_wav, srate=16000, eps=1e-10):
"""Segmental Signal-to-Noise Ratio.
Computes per-frame SNR, clipped to [-10, 35] dB range.
Returns
-------
tuple
(overall_snr, list_of_segmental_snr_values)
"""
External PESQ
Signature
from pesq import pesq
score = pesq(fs=16000, ref=ref_array, deg=deg_array, mode="wb")
Parameters
| Parameter | Type | Description |
|---|---|---|
fs |
int | Sample rate (8000 for narrowband, 16000 for wideband) |
ref |
numpy.ndarray | Clean reference signal |
deg |
numpy.ndarray | Degraded/enhanced signal |
mode |
str | "wb" (wideband) or "nb" (narrowband)
|
Usage in SpeechBrain Training
# Normalized PESQ for MetricGAN+ (maps to 0-1 range)
def pesq_eval(pred_wav, target_wav):
"""Normalized PESQ (to 0-1)"""
return (
pesq(fs=16000, ref=target_wav.numpy(), deg=pred_wav.numpy(), mode="wb")
+ 0.5
) / 5
# Raw PESQ for SEBrain validation
def pesq_eval(pred_wav, target_wav):
return pesq(
fs=16000, ref=target_wav.numpy(),
deg=pred_wav.numpy(), mode="wb",
)
External STOI
Signature
from pystoi import stoi
score = stoi(ref, deg, fs, extended=False)
Parameters
| Parameter | Type | Description |
|---|---|---|
ref |
numpy.ndarray | Clean reference signal |
deg |
numpy.ndarray | Degraded/enhanced signal |
fs |
int | Sample rate |
extended |
bool | If True, use extended STOI (better for non-linear distortions) |
SpeechBrain STOI Loss (Differentiable)
SpeechBrain also provides a differentiable STOI implementation for use as a training loss or metric:
from speechbrain.nnet.loss.stoi_loss import stoi_loss
# Used as a metric during validation
stoi_metric = MetricStats(metric=stoi_loss)
stoi_metric.append(
batch_id, predict_wav, clean_wavs, lens, reduction="batch"
)
# Note: stoi_loss returns negative STOI, so negate for reporting
stoi_value = -stoi_metric.summarize("average")
DNSMOS (ComputeScore)
Class Signature
from dnsmos_local import ComputeScore
compute_score = ComputeScore(primary_model_path="DNSMOS/sig_bak_ovr.onnx")
result = compute_score(
fpath="enhanced_audio.wav",
sampling_rate=16000,
is_personalized_MOS=False
)
Return Values
| Key | Range | Description |
|---|---|---|
SIG |
1 to 5 | Signal quality (calibrated) |
BAK |
1 to 5 | Background quality (calibrated) |
OVRL |
1 to 5 | Overall quality (calibrated) |
SIG_raw |
float | Raw model output for signal |
BAK_raw |
float | Raw model output for background |
OVRL_raw |
float | Raw model output for overall |
Processing Details
class ComputeScore:
def __init__(self, primary_model_path):
self.onnx_sess = ort.InferenceSession(primary_model_path)
def __call__(self, fpath, sampling_rate, is_personalized_MOS):
# Load and resample audio to 16 kHz
aud, input_fs = sf.read(fpath)
# Pad to minimum length (9.01 seconds)
len_samples = int(9.01 * fs)
while len(audio) < len_samples:
audio = np.append(audio, audio)
# Process in 1-second hops over 9-second segments
for idx in range(num_hops):
audio_seg = audio[idx * fs : int((idx + 9.01) * fs)]
input_features = np.array(audio_seg).astype("float32")[np.newaxis, :]
mos_sig_raw, mos_bak_raw, mos_ovr_raw = self.onnx_sess.run(
None, {"input_1": input_features}
)[0][0]
# Apply polynomial calibration
mos_sig, mos_bak, mos_ovr = self.get_polyfit_val(
mos_sig_raw, mos_bak_raw, mos_ovr_raw, is_personalized_MOS=0
)
# Average across segments
return {"SIG": np.mean(predicted_mos_sig_seg), ...}
Usage Examples
Evaluating a Single Utterance
import numpy as np
import librosa
from composite_eval import eval_composite
from pesq import pesq
from pystoi import stoi
# Load audio files
clean, sr = librosa.load("clean_speech.wav", sr=16000)
enhanced, sr = librosa.load("enhanced_speech.wav", sr=16000)
# Compute composite metrics
composite_scores = eval_composite(clean, enhanced, sample_rate=16000)
print(f"PESQ: {composite_scores['pesq']:.3f}")
print(f"CSIG: {composite_scores['csig']:.3f}")
print(f"CBAK: {composite_scores['cbak']:.3f}")
print(f"COVL: {composite_scores['covl']:.3f}")
# Compute STOI separately
stoi_score = stoi(clean, enhanced, 16000, extended=False)
print(f"STOI: {stoi_score:.3f}")
Batch Evaluation of Test Set
import os
import numpy as np
import librosa
from composite_eval import eval_composite
from pystoi import stoi
from tqdm import tqdm
clean_dir = "/data/noisy-vctk-16k/clean_testset_wav_16k"
enhanced_dir = "results/enhanced_wavs"
all_pesq, all_csig, all_cbak, all_covl, all_stoi = [], [], [], [], []
for filename in tqdm(os.listdir(clean_dir)):
if not filename.endswith(".wav"):
continue
clean, sr = librosa.load(os.path.join(clean_dir, filename), sr=16000)
enhanced, sr = librosa.load(os.path.join(enhanced_dir, filename), sr=16000)
# Composite metrics (includes PESQ)
scores = eval_composite(clean, enhanced, sample_rate=16000)
all_pesq.append(scores["pesq"])
all_csig.append(scores["csig"])
all_cbak.append(scores["cbak"])
all_covl.append(scores["covl"])
# STOI
all_stoi.append(stoi(clean, enhanced, 16000))
print(f"PESQ: {np.mean(all_pesq):.3f}")
print(f"CSIG: {np.mean(all_csig):.3f}")
print(f"CBAK: {np.mean(all_cbak):.3f}")
print(f"COVL: {np.mean(all_covl):.3f}")
print(f"STOI: {np.mean(all_stoi):.3f}")
DNSMOS Evaluation (Reference-Free)
from dnsmos_local import ComputeScore
# Initialize with ONNX model path
compute_score = ComputeScore("DNSMOS/sig_bak_ovr.onnx")
# Evaluate a single file (no clean reference needed)
result = compute_score(
fpath="enhanced_speech.wav",
sampling_rate=16000,
is_personalized_MOS=False,
)
print(f"DNSMOS SIG: {result['SIG']:.3f}")
print(f"DNSMOS BAK: {result['BAK']:.3f}")
print(f"DNSMOS OVRL: {result['OVRL']:.3f}")
Dependencies
| Package | Version | Used For |
|---|---|---|
pesq |
>= 0.0.3 | PESQ score computation |
pystoi |
>= 0.3 | STOI score computation |
numpy |
>= 1.20 | Array operations for all metrics |
scipy |
>= 1.7 | Toeplitz matrix for LLR computation |
librosa |
>= 0.8 | Audio loading and resampling |
onnxruntime |
>= 1.10 | DNSMOS model inference |
soundfile |
>= 0.10 | Audio I/O for DNSMOS |
Notes and Edge Cases
- Length mismatch handling:
eval_compositeautomatically truncates both signals to the minimum length, preventing errors from slight length differences - Alpha trimming: WSS and LLR values are alpha-trimmed (top 5% removed) before averaging, reducing the impact of outlier frames
- MOS clamping: All composite scores are clamped to [1, 5] via
trim_mos() - DNSMOS padding: Audio shorter than 9.01 seconds is padded by self-repetition
- NaN handling: LLR computation uses
np.nan_to_num()to handle edge cases where the ratio computation produces NaN values - Negative STOI in SpeechBrain: The
stoi_lossfunction returns negative STOI (for use as a minimization loss), so it must be negated when reporting
See Also
- Principle:Speechbrain_Speechbrain_Perceptual_Quality_Evaluation -- The theoretical foundation for perceptual evaluation
- Implementation:Speechbrain_Speechbrain_MetricGanBrain_Fit_Batch -- How PESQ is used as a discriminator training target
- Implementation:Speechbrain_Speechbrain_SEBrain_Compute_Forward -- How metrics are computed during conventional training validation
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment