Workflow:Facebookresearch Audiocraft EnCodec Compression Training
| Knowledge Sources | |
|---|---|
| Domains | Audio_Compression, Neural_Codec, Model_Training |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
End-to-end process for training an EnCodec neural audio compression model using the AudioCraft CompressionSolver with adversarial and objective losses.
Description
This workflow covers training an encoder-decoder audio compression model with a Residual Vector Quantization (RVQ) bottleneck. The model uses a SEANet encoder-decoder architecture trained with a combination of reconstruction losses (multi-scale STFT, mel spectrogram), perceptual adversarial losses (MS-STFT discriminator), and a gradient-based loss balancer. The trained EnCodec model serves as the foundational audio tokenizer for downstream generation models (MusicGen, AudioGen, MAGNeT, JASCO).
Usage
Execute this workflow when you need to train a custom audio tokenizer, either to support a new sample rate, different audio domain, or to improve compression quality for your specific use case. The resulting model converts audio waveforms to discrete token sequences and back, enabling discrete language model training over audio.
Execution Steps
Step 1: Environment Setup
Configure the AudioCraft training environment with appropriate Dora output directories, team settings, and cluster configuration. Ensure persistent storage is configured for checkpoints since the default /tmp/ path is unsuitable for real training.
Key considerations:
- Same environment setup as any AudioCraft training task
- Set AUDIOCRAFT_TEAM and AUDIOCRAFT_DORA_DIR environment variables
- Ensure sufficient GPU memory for training with discriminators
Step 2: Prepare Audio Dataset
Organize audio data with manifest files in AudioCraft's expected format. The compression task uses AudioDataset without music-specific metadata requirements, making dataset preparation simpler than for MusicGen.
Dataset format:
- Audio files in WAV or supported formats
- Manifest files (JSONL) listing audio file paths and basic metadata
- Configure dataset YAML with paths for train, valid, evaluate, and generate splits
- Set target sample rate (24 kHz for base EnCodec, 32 kHz for MusicGen's EnCodec)
Step 3: Configure Model Architecture
Select the EnCodec architecture variant and configure the SEANet encoder-decoder parameters, quantizer settings, and training objectives. Key architecture choices include the number of codebooks, codebook size, and whether to use causal convolutions.
Architecture options:
- Base causal: 32 quantizers with dropout (encodec_base_causal)
- Large nq4 s320: 4 codebooks, 2048 bins, stride 320 (for MusicGen/AudioGen)
- Large nq4 s640: 4 codebooks, 2048 bins, stride 640 (higher compression)
Loss configuration:
- Reconstruction: multi-scale STFT loss, mel spectrogram L1 loss
- Adversarial: MS-STFT discriminator with hinge or least-squares loss
- Balancer: gradient-based loss weighting from the EnCodec paper
- Optional: SI-SNR, loudness loss
Step 4: Launch Training
Start the compression training using Dora with the selected solver configuration. The CompressionSolver manages the training loop with alternating generator and discriminator updates, loss balancing, and periodic evaluation.
Launch commands:
- Base 24 kHz: dora grid compression.encodec_base_24khz
- MusicGen 32 kHz: dora grid compression.encodec_musicgen_32khz
- AudioGen 16 kHz: dora grid compression.encodec_audiogen_16khz
- Debug: dora run solver=compression/debug
Training loop:
- Forward pass through encoder, quantizer, and decoder
- Compute reconstruction losses on output vs. input
- Discriminator forward and backward pass
- Generator adversarial loss computation
- Gradient balancing across all loss components
- EMA weight updates
Step 5: Evaluate Compression Quality
Assess the trained model using objective audio quality metrics. The evaluation stage computes SI-SNR (scale-invariant signal-to-noise ratio) and optionally ViSQOL (perceptual quality) on held-out data.
Evaluation metrics:
- SI-SNR: measures reconstruction fidelity in signal domain
- ViSQOL: perceptual quality score (requires external binary)
- Generation stage: produces reconstructed audio samples for listening
Key considerations:
- ViSQOL requires a separately compiled binary
- Evaluation runs periodically during training (configurable frequency)
- Generated samples can be compared using the MOS listening tool
Step 6: Export Trained Model
Export the trained EnCodec checkpoint to a lightweight format suitable for use as a tokenizer in downstream generation models. The export strips optimizer state and training metadata, keeping only the model weights and configuration.
Export process:
- Use audiocraft.utils.export.export_encodec() to create compressed checkpoint
- The exported file contains best_state model weights and xp.cfg configuration
- Load exported model via CompressionModel.get_pretrained() or CompressionSolver.model_from_checkpoint()
- Reference the exported model in MusicGen training via compression_model_checkpoint parameter