Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Facebookresearch Audiocraft EnCodec Compression Training

From Leeroopedia
Revision as of 11:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Facebookresearch_Audiocraft_EnCodec_Compression_Training.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Audio_Compression, Neural_Codec, Model_Training
Last Updated 2026-02-13 23:00 GMT

Overview

End-to-end process for training an EnCodec neural audio compression model using the AudioCraft CompressionSolver with adversarial and objective losses.

Description

This workflow covers training an encoder-decoder audio compression model with a Residual Vector Quantization (RVQ) bottleneck. The model uses a SEANet encoder-decoder architecture trained with a combination of reconstruction losses (multi-scale STFT, mel spectrogram), perceptual adversarial losses (MS-STFT discriminator), and a gradient-based loss balancer. The trained EnCodec model serves as the foundational audio tokenizer for downstream generation models (MusicGen, AudioGen, MAGNeT, JASCO).

Usage

Execute this workflow when you need to train a custom audio tokenizer, either to support a new sample rate, different audio domain, or to improve compression quality for your specific use case. The resulting model converts audio waveforms to discrete token sequences and back, enabling discrete language model training over audio.

Execution Steps

Step 1: Environment Setup

Configure the AudioCraft training environment with appropriate Dora output directories, team settings, and cluster configuration. Ensure persistent storage is configured for checkpoints since the default /tmp/ path is unsuitable for real training.

Key considerations:

  • Same environment setup as any AudioCraft training task
  • Set AUDIOCRAFT_TEAM and AUDIOCRAFT_DORA_DIR environment variables
  • Ensure sufficient GPU memory for training with discriminators

Step 2: Prepare Audio Dataset

Organize audio data with manifest files in AudioCraft's expected format. The compression task uses AudioDataset without music-specific metadata requirements, making dataset preparation simpler than for MusicGen.

Dataset format:

  • Audio files in WAV or supported formats
  • Manifest files (JSONL) listing audio file paths and basic metadata
  • Configure dataset YAML with paths for train, valid, evaluate, and generate splits
  • Set target sample rate (24 kHz for base EnCodec, 32 kHz for MusicGen's EnCodec)

Step 3: Configure Model Architecture

Select the EnCodec architecture variant and configure the SEANet encoder-decoder parameters, quantizer settings, and training objectives. Key architecture choices include the number of codebooks, codebook size, and whether to use causal convolutions.

Architecture options:

  • Base causal: 32 quantizers with dropout (encodec_base_causal)
  • Large nq4 s320: 4 codebooks, 2048 bins, stride 320 (for MusicGen/AudioGen)
  • Large nq4 s640: 4 codebooks, 2048 bins, stride 640 (higher compression)

Loss configuration:

  • Reconstruction: multi-scale STFT loss, mel spectrogram L1 loss
  • Adversarial: MS-STFT discriminator with hinge or least-squares loss
  • Balancer: gradient-based loss weighting from the EnCodec paper
  • Optional: SI-SNR, loudness loss

Step 4: Launch Training

Start the compression training using Dora with the selected solver configuration. The CompressionSolver manages the training loop with alternating generator and discriminator updates, loss balancing, and periodic evaluation.

Launch commands:

  • Base 24 kHz: dora grid compression.encodec_base_24khz
  • MusicGen 32 kHz: dora grid compression.encodec_musicgen_32khz
  • AudioGen 16 kHz: dora grid compression.encodec_audiogen_16khz
  • Debug: dora run solver=compression/debug

Training loop:

  • Forward pass through encoder, quantizer, and decoder
  • Compute reconstruction losses on output vs. input
  • Discriminator forward and backward pass
  • Generator adversarial loss computation
  • Gradient balancing across all loss components
  • EMA weight updates

Step 5: Evaluate Compression Quality

Assess the trained model using objective audio quality metrics. The evaluation stage computes SI-SNR (scale-invariant signal-to-noise ratio) and optionally ViSQOL (perceptual quality) on held-out data.

Evaluation metrics:

  • SI-SNR: measures reconstruction fidelity in signal domain
  • ViSQOL: perceptual quality score (requires external binary)
  • Generation stage: produces reconstructed audio samples for listening

Key considerations:

  • ViSQOL requires a separately compiled binary
  • Evaluation runs periodically during training (configurable frequency)
  • Generated samples can be compared using the MOS listening tool

Step 6: Export Trained Model

Export the trained EnCodec checkpoint to a lightweight format suitable for use as a tokenizer in downstream generation models. The export strips optimizer state and training metadata, keeping only the model weights and configuration.

Export process:

  • Use audiocraft.utils.export.export_encodec() to create compressed checkpoint
  • The exported file contains best_state model weights and xp.cfg configuration
  • Load exported model via CompressionModel.get_pretrained() or CompressionSolver.model_from_checkpoint()
  • Reference the exported model in MusicGen training via compression_model_checkpoint parameter

Execution Diagram

GitHub URL

Workflow Repository