Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Facebookresearch Audiocraft Compression Quality Evaluation

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

Evaluating audio compression quality using complementary perceptual and signal-level metrics. ViSQOL (Virtual Speech Quality Objective Listener) provides a perceptual quality score that correlates with human Mean Opinion Scores (MOS), while SI-SNR (Scale-Invariant Signal-to-Noise Ratio) measures the signal-level reconstruction fidelity. Together, these metrics assess whether the neural audio codec preserves both perceptual quality and waveform accuracy.

Description

After training an EnCodec model, the CompressionSolver.evaluate() method runs audio through the codec and measures reconstruction quality using two complementary metrics:

  • ViSQOL (MOS-LQO) -- a perceptual quality metric developed by Google that compares reference and degraded audio signals. It operates on the spectro-temporal representation of audio and uses a support vector regression model to predict a Mean Opinion Score - Listening Quality Objective (MOS-LQO) on a 1-5 scale. ViSQOL supports two modes:
    • Audio mode -- expects 48kHz input, uses SVR for quality prediction, maximum range ~4.75
    • Speech mode -- expects 16kHz input, includes voice activity detection, scaled to maximum MOS of 5.0
  • SI-SNR -- a scale-invariant signal-to-noise ratio that measures reconstruction accuracy at the waveform level. It projects the estimated signal onto the reference signal and measures the ratio of signal energy to noise (residual) energy. Unlike standard SNR, SI-SNR is invariant to the overall scale of the signals, focusing purely on waveform shape fidelity.

Usage

These metrics are specific to audio compression evaluation and are invoked during the evaluation stage of the CompressionSolver. They are not used in the MusicGen or AudioGen training workflows, which rely on different evaluation criteria (e.g., FAD, CLAP score).

The evaluation is triggered automatically during training:

# During CompressionSolver training, evaluation runs at configured intervals
# and reports both visqol and sisnr metrics

Theoretical Basis

ViSQOL Perceptual Quality Model

ViSQOL (Chinen et al., 2020) is a full-reference audio quality metric. It aligns reference and degraded audio using a neurogram similarity measure based on gammatone filterbank representations, then maps the similarity to a MOS-LQO score via a trained support vector regression model.

ViSQOL Pipeline:
    1. Resample to target sample rate (48kHz for audio, 16kHz for speech)
    2. Compute spectro-temporal representation (gammatone filterbank)
    3. Align reference and degraded representations using NSIM patches
    4. Aggregate patch similarities across time
    5. Map aggregated similarity to MOS-LQO via SVR model

Output: MOS-LQO score in range [1.0, 5.0]
    1.0 = very poor quality
    5.0 = excellent (transparent) quality

ViSQOL is an extrinsic metric -- it compares two complete audio signals and requires no access to the compression model internals. Higher scores indicate better perceptual quality.

Scale-Invariant Signal-to-Noise Ratio

SI-SNR (Roux et al., 2019) measures waveform reconstruction fidelity in a scale-invariant manner. Given reference signal s and estimate s_hat:

SI-SNR Computation:
    1. Center both signals:    s' = s - mean(s),  s_hat' = s_hat - mean(s_hat)
    2. Project estimate onto reference:
        s_target = (dot(s_hat', s') / ||s'||^2) * s'
    3. Compute noise:
        e_noise = s_hat' - s_target
    4. Compute ratio:
        SI-SNR = 10 * log10(||s_target||^2 / ||e_noise||^2)    (dB)

In Audiocraft's implementation, SI-SNR is negated (multiplied by -1) so that it can serve as a loss function during training -- lower values indicate better reconstruction. When reported as an evaluation metric, more negative values indicate higher reconstruction quality.

The implementation also supports segmented evaluation: long audio is split into overlapping frames (configurable segment length and overlap), and SI-SNR is computed per-frame and averaged. This prevents long silent passages from dominating the metric.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment