Workflow:Facebookresearch Audiocraft MusicGen Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Audio_Generation, Model_Training, Distributed_Training |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
End-to-end process for training a MusicGen autoregressive language model over discrete audio tokens using the AudioCraft Dora/Hydra training infrastructure.
Description
This workflow covers the complete training pipeline for MusicGen models, from environment and dataset preparation through model training, evaluation, and checkpoint management. It uses the MusicGenSolver which implements an autoregressive language modeling task over multiple streams of discrete tokens extracted from a pre-trained EnCodec model. The pipeline is built on Dora (experiment manager) and Hydra (configuration) with support for distributed training via FSDP, conditioner embedding caching, and comprehensive evaluation metrics (FAD, KLD, CLAP).
Usage
Execute this workflow when you need to train a new MusicGen model from scratch or fine-tune an existing pretrained model on a custom music dataset. Requires a multi-GPU setup for full-scale training (32+ GPUs for the small model, 64+ for medium), though debugging can be done on a single GPU. The dataset must be prepared with audio files and corresponding JSON metadata files.
Execution Steps
Step 1: Environment and Cluster Setup
Configure the AudioCraft environment including the team configuration, Dora experiment output directory, and SLURM cluster settings. Set the AUDIOCRAFT_TEAM environment variable and ensure the dora_dir path points to persistent storage (the default /tmp/ is only suitable for quick tests).
Key considerations:
- Set AUDIOCRAFT_TEAM to match your cluster (default, labs, or custom)
- Override AUDIOCRAFT_DORA_DIR for persistent checkpoint storage
- AUDIOCRAFT_REFERENCE_DIR for shared pretrained model references
- Cluster type is auto-detected but can be overridden via AUDIOCRAFT_CLUSTER
Step 2: Prepare Audio Dataset
Organize the training data as audio files with accompanying per-file JSON metadata. Create manifest files (one JSON line per audio file) that list all tracks with their metadata. Configure the dataset YAML to point to the manifest files for train, valid, evaluate, and generate splits.
Dataset structure:
- Audio files in WAV or other supported formats
- JSON metadata files alongside audio files (same name, .json extension)
- Metadata includes: title, artist, key, BPM, genre, description, etc.
- Manifest files in JSONL format listing audio paths and metadata
- Dataset config YAML defining paths and sample rate
Key considerations:
- MusicGen uses MusicDataset which extends AudioDataset with music metadata
- Sample rate should match the EnCodec model (32 kHz for MusicGen)
- Segments are randomly sampled from tracks during training
Step 3: Select Audio Tokenizer
Choose and configure the EnCodec compression model that will tokenize audio into discrete representations. This can be a pretrained model from HuggingFace, a custom-trained EnCodec, or an alternative tokenizer like DAC.
Options:
- Pretrained: facebook/encodec_32khz (default for MusicGen)
- Custom: provide a Dora signature or path to a trained EnCodec checkpoint
- Alternative: DAC (dac_44khz) with appropriate n_q and card settings
- Set transformer_lm.n_q and transformer_lm.card to match the tokenizer
Step 4: Configure Training Run
Select the solver configuration, model scale, conditioner type, and training hyperparameters. Compose the configuration from YAML config groups using Hydra overrides. For fine-tuning, set the continue_from parameter to initialize from a pretrained checkpoint.
Configuration hierarchy:
- Solver: musicgen/musicgen_base_32khz (text-to-music) or musicgen/musicgen_melody_32khz (melody)
- Model scale: small (300M), medium (1.5B), or large (3.3B)
- Conditioner: text2music, chroma2music, style2music, clapemb2music
- Optim: updates_per_epoch, learning rate, scheduler
- For fine-tuning: continue_from=//pretrained/facebook/musicgen-medium
Key considerations:
- Never modify default YAML config files directly; use Hydra overrides
- Configuration changes affect experiment signatures (hash-based tracking)
- FSDP and autocast are mutually exclusive
- Use autocast for models up to 1.5B, FSDP for larger models
Step 5: Launch Training
Launch the training job using Dora, either as a single run or through a grid for sweeping hyperparameters. The MusicGenSolver training loop iterates over epochs, each consisting of a train stage (fixed number of update steps), a validation stage, and periodic evaluation and generation stages.
Launch methods:
- Single run: dora run solver=musicgen/musicgen_base_32khz [overrides]
- Grid: dora grid musicgen.musicgen_base_32khz
- Distributed: add -d flag for multi-GPU local training
- SLURM: grids automatically schedule on SLURM clusters
Training loop per epoch:
- Train stage: runs for optim.updates_per_epoch steps (default 2000)
- Valid stage: computes cross-entropy and perplexity on validation set
- Evaluate stage: runs every N epochs with objective metrics (FAD, KLD, CLAP)
- Generate stage: produces audio samples every N epochs for listening
Step 6: Monitor and Evaluate
Monitor training progress through Dora experiment tracking, validation metrics, and periodic sample generation. Use the MOS listening tool to compare generated samples across model versions.
Monitoring tools:
- dora info -f SIG -t to tail training logs
- Validation loss (cross-entropy, perplexity) tracked each epoch
- Objective metrics: FAD (distributional quality), KLD (audio similarity), CLAP (text consistency)
- MOS tool: Flask web app for side-by-side audio comparison
Key considerations:
- An epoch does not necessarily mean one pass over the entire dataset
- Checkpointing happens every epoch (target ~30 min per epoch)
- EMA (Exponential Moving Average) weights are tracked for the best model state
- Training can be safely interrupted and resumed from the latest checkpoint
Step 7: Manage Checkpoints
Handle checkpoint saving, loading, and experiment management. Dora maintains checkpoints in signature-based folders, supports resuming interrupted training, and tracks the best model state based on validation metrics.
Checkpoint operations:
- Automatic saving every epoch with best state tracking
- Resume: simply re-run the same dora command (same signature reuses folder)
- Clear and restart: dora run --clear to discard previous checkpoints
- Access trained model: MusicGenSolver.get_eval_solver_from_sig('SIG')