Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Facebookresearch Audiocraft Compression Training Execution

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

A training procedure for neural audio codecs that combines multiple loss signals -- reconstruction losses (multi-resolution STFT, multi-scale mel spectrogram), adversarial losses (from a Multi-Scale STFT discriminator), commitment losses (from the RVQ), and feature matching losses -- through a gradient balancer that normalizes competing gradient magnitudes. This approach enables stable end-to-end training of the encoder-decoder-quantizer system.

Description

Training an EnCodec model involves optimizing two sets of parameters simultaneously:

  • Generator (encoder + quantizer + decoder) -- trained to minimize reconstruction error while fooling the discriminator.
  • Discriminator (MS-STFT discriminator) -- trained to distinguish real audio from reconstructed audio.

The training step proceeds as follows:

  1. Forward pass -- raw audio is passed through the full EnCodec model (encode, quantize, decode) to produce reconstructed audio and a commitment loss from the RVQ.
  2. Discriminator update -- with a configurable frequency (controlled by adversarial.every), the discriminator is trained on real vs. reconstructed audio pairs.
  3. Loss computation -- multiple losses are computed on the reconstructed output:
    • Auxiliary losses (e.g., multi-resolution STFT loss, multi-scale mel spectrogram loss) measure spectral reconstruction quality.
    • Adversarial losses measure how well the reconstruction fools the discriminator.
    • Feature matching losses measure similarity of intermediate discriminator features between real and fake.
    • Commitment loss (penalty) from the RVQ encourages the encoder output to stay close to codebook vectors.
  4. Gradient balancing -- the Balancer computes partial gradients of each loss with respect to the model output, rescales them according to desired weight ratios, and sums them. This prevents any single loss from dominating the gradient signal.
  5. Optimizer step -- after gradient clipping, the generator optimizer updates the model parameters.

Usage

This training procedure is specific to the compression task and is invoked by the Audiocraft training framework (via Dora) using a Hydra config that specifies the compression solver. It is not used for training language models like MusicGen, which use the MusicGenSolver instead.

The compression solver is launched via:

dora run solver=compression/encodec_base_24khz

Theoretical Basis

GAN-Based Audio Codec Training

The training strategy follows the GAN framework for audio generation introduced by MelGAN and adopted by SoundStream and EnCodec. The generator (codec) and discriminator are trained alternately, with the discriminator providing a learned perceptual loss that captures audio quality aspects beyond what hand-crafted spectral losses can measure.

Gradient Balancing

A key challenge in multi-loss training is that different losses can have vastly different gradient magnitudes, causing one loss to dominate the optimization. The Balancer addresses this by normalizing gradients:

Given losses L_1, L_2, ..., L_n with desired weight ratios w_1, w_2, ..., w_n:

1. Compute partial gradients:  g_i = d L_i / d y   (y = model output)
2. Compute EMA of gradient norms:  avg_i = EMA(||g_i||)
3. Compute balanced gradient:

   G = sum_i [ total_norm * g_i / avg_i * w_i / sum(w_j) ]

4. Backpropagate G through the model.

This ensures that each loss contributes a fraction of the total gradient proportional to its weight, regardless of its intrinsic scale. The EMA smoothing prevents oscillations from batch-to-batch gradient norm variation.

Adversarial Training with MS-STFT Discriminator

The Multi-Scale STFT Discriminator operates on the complex STFT representation of audio at multiple resolutions (different FFT sizes, hop lengths, and window lengths). Each sub-discriminator produces logits and intermediate feature maps. The adversarial loss encourages realistic spectral structure, while the feature matching loss enforces similarity at intermediate representation levels.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment