Principle:Facebookresearch Audiocraft Audio Generation Evaluation
Overview
Audio Generation Evaluation concerns the objective measurement of generated audio quality in the MusicGen training pipeline. Because audio generation quality is inherently subjective and multi-dimensional, evaluation relies on a suite of complementary metrics that capture different aspects: distributional similarity (how similar the statistics of generated audio are to real audio), classifier-based acoustic similarity (whether a classifier produces similar predictions on generated and reference audio), and text-audio alignment (how well the generated audio matches its text description). These metrics are computed periodically during training to monitor progress and select the best model checkpoint.
Theoretical Foundations
Frechet Audio Distance (FAD)
The Frechet Audio Distance is the audio analog of the Frechet Inception Distance (FID) used in image generation. It measures the distributional similarity between generated and reference audio:
- Both generated and reference audio samples are passed through a pretrained audio classifier (VGGish, trained on AudioSet) to extract embedding vectors.
- The embeddings for each set are modeled as multivariate Gaussian distributions (characterized by mean and covariance).
- The Frechet distance between the two Gaussians is computed:
FAD = ||mu_gen - mu_ref||^2 + Tr(Sigma_gen + Sigma_ref - 2 * sqrt(Sigma_gen * Sigma_ref))
A lower FAD indicates that the generated audio distribution is closer to the reference distribution. FAD captures overall acoustic quality and diversity but does not measure text alignment.
Reference: D.C. Dowson & B.V. Landau, "The Frechet distance between multivariate normal distributions" (1982).
KL Divergence on Audio Classifiers (KLD)
KL Divergence measures how well a generated audio sample matches its corresponding reference in terms of classifier predictions:
- Both generated and reference audio are passed through a pretrained audio classifier (PaSST, trained on AudioSet) to obtain class probability distributions.
- The KL divergence between the prediction distributions is computed:
KLD = sum(p_ref * log(p_ref / p_gen))
A lower KLD indicates that the classifier produces similar predictions for generated and reference audio. This metric is computed per-sample and averaged, capturing acoustic similarity at the individual level rather than the distributional level.
Reference: Kharitonov et al., "PaSST: Efficient Training of Audio Transformers with Patchout" (2021).
CLAP Text Consistency
CLAP (Contrastive Language-Audio Pretraining) measures the alignment between generated audio and its text description:
- Text descriptions are encoded using CLAP's text encoder to produce text embeddings.
- Generated audio is encoded using CLAP's audio encoder to produce audio embeddings.
- Cosine similarity between paired text and audio embeddings is computed.
A higher CLAP score indicates better text-audio alignment. This metric directly evaluates whether the model generates audio that matches its conditioning text.
Reference: Similar to MuLan Cycle Consistency in MusicLM (Agostinelli et al., 2023) and CLAP score in Make-An-Audio (Huang et al., 2023).
Key Principles
- Complementary metrics -- No single metric captures all aspects of audio quality. FAD measures distributional quality, KLD measures acoustic similarity to references, and CLAP measures text alignment. All three should be monitored.
- Periodic evaluation -- Evaluation is expensive (requires generating audio samples and running multiple pretrained classifiers). It is therefore run periodically (e.g., every 25 epochs) rather than every epoch.
- Best state selection -- The best model checkpoint is selected based on a primary metric (typically cross-entropy on validation). Evaluation metrics provide complementary signals for understanding model behavior.
- Reference-based and reference-free -- FAD and KLD require reference audio (the ground truth). CLAP requires text descriptions but evaluates generated audio directly against text, not against reference audio.
Evaluation Workflow
During the evaluate stage of MusicGen training:
- The solver generates audio samples from the evaluation dataset's metadata using the current best model state.
- Generated audio and reference audio are passed to each enabled metric.
- Metrics accumulate statistics across the evaluation set.
- Final scores are computed and logged.
The evaluation metrics can optionally be run on compressed reference audio (encode then decode through the codec) to measure the upper bound of quality achievable given the tokenizer's reconstruction capability.
Role in the MusicGen Training Pipeline
Evaluation is an outer loop stage that runs periodically during training. It depends on:
- A trained model (from the training execution stage)
- The frozen tokenizer (for generating audio from tokens)
- External pretrained models (VGGish for FAD, PaSST for KLD, CLAP for text consistency)
Evaluation results are logged to TensorBoard/wandb and stored in the experiment history for comparison across runs.