Workflow:Facebookresearch Audiocraft MusicGen Training Pipeline

Knowledge Sources	AudioCraft Training Docs MusicGen Paper
Domains	Audio_Generation, Model_Training, Distributed_Training
Last Updated	2026-02-13 23:00 GMT

Overview

End-to-end process for training a MusicGen autoregressive language model over discrete audio tokens using the AudioCraft Dora/Hydra training infrastructure.

Description

This workflow covers the complete training pipeline for MusicGen models, from environment and dataset preparation through model training, evaluation, and checkpoint management. It uses the MusicGenSolver which implements an autoregressive language modeling task over multiple streams of discrete tokens extracted from a pre-trained EnCodec model. The pipeline is built on Dora (experiment manager) and Hydra (configuration) with support for distributed training via FSDP, conditioner embedding caching, and comprehensive evaluation metrics (FAD, KLD, CLAP).

Usage

Execute this workflow when you need to train a new MusicGen model from scratch or fine-tune an existing pretrained model on a custom music dataset. Requires a multi-GPU setup for full-scale training (32+ GPUs for the small model, 64+ for medium), though debugging can be done on a single GPU. The dataset must be prepared with audio files and corresponding JSON metadata files.

Execution Steps

Step 1: Environment and Cluster Setup

Configure the AudioCraft environment including the team configuration, Dora experiment output directory, and SLURM cluster settings. Set the AUDIOCRAFT_TEAM environment variable and ensure the dora_dir path points to persistent storage (the default /tmp/ is only suitable for quick tests).

Key considerations:

Set AUDIOCRAFT_TEAM to match your cluster (default, labs, or custom)
Override AUDIOCRAFT_DORA_DIR for persistent checkpoint storage
AUDIOCRAFT_REFERENCE_DIR for shared pretrained model references
Cluster type is auto-detected but can be overridden via AUDIOCRAFT_CLUSTER

Step 2: Prepare Audio Dataset

Organize the training data as audio files with accompanying per-file JSON metadata. Create manifest files (one JSON line per audio file) that list all tracks with their metadata. Configure the dataset YAML to point to the manifest files for train, valid, evaluate, and generate splits.

Dataset structure:

Audio files in WAV or other supported formats
JSON metadata files alongside audio files (same name, .json extension)
Metadata includes: title, artist, key, BPM, genre, description, etc.
Manifest files in JSONL format listing audio paths and metadata
Dataset config YAML defining paths and sample rate

Key considerations:

MusicGen uses MusicDataset which extends AudioDataset with music metadata
Sample rate should match the EnCodec model (32 kHz for MusicGen)
Segments are randomly sampled from tracks during training

Step 3: Select Audio Tokenizer

Choose and configure the EnCodec compression model that will tokenize audio into discrete representations. This can be a pretrained model from HuggingFace, a custom-trained EnCodec, or an alternative tokenizer like DAC.

Options:

Pretrained: facebook/encodec_32khz (default for MusicGen)
Custom: provide a Dora signature or path to a trained EnCodec checkpoint
Alternative: DAC (dac_44khz) with appropriate n_q and card settings
Set transformer_lm.n_q and transformer_lm.card to match the tokenizer

Step 4: Configure Training Run

Select the solver configuration, model scale, conditioner type, and training hyperparameters. Compose the configuration from YAML config groups using Hydra overrides. For fine-tuning, set the continue_from parameter to initialize from a pretrained checkpoint.

Configuration hierarchy:

Solver: musicgen/musicgen_base_32khz (text-to-music) or musicgen/musicgen_melody_32khz (melody)
Model scale: small (300M), medium (1.5B), or large (3.3B)
Conditioner: text2music, chroma2music, style2music, clapemb2music
Optim: updates_per_epoch, learning rate, scheduler
For fine-tuning: continue_from=//pretrained/facebook/musicgen-medium

Key considerations:

Never modify default YAML config files directly; use Hydra overrides
Configuration changes affect experiment signatures (hash-based tracking)
FSDP and autocast are mutually exclusive
Use autocast for models up to 1.5B, FSDP for larger models

Step 5: Launch Training

Launch the training job using Dora, either as a single run or through a grid for sweeping hyperparameters. The MusicGenSolver training loop iterates over epochs, each consisting of a train stage (fixed number of update steps), a validation stage, and periodic evaluation and generation stages.

Launch methods:

Single run: dora run solver=musicgen/musicgen_base_32khz [overrides]
Grid: dora grid musicgen.musicgen_base_32khz
Distributed: add -d flag for multi-GPU local training
SLURM: grids automatically schedule on SLURM clusters

Training loop per epoch:

Train stage: runs for optim.updates_per_epoch steps (default 2000)
Valid stage: computes cross-entropy and perplexity on validation set
Evaluate stage: runs every N epochs with objective metrics (FAD, KLD, CLAP)
Generate stage: produces audio samples every N epochs for listening

Step 6: Monitor and Evaluate

Monitor training progress through Dora experiment tracking, validation metrics, and periodic sample generation. Use the MOS listening tool to compare generated samples across model versions.

Monitoring tools:

dora info -f SIG -t to tail training logs
Validation loss (cross-entropy, perplexity) tracked each epoch
Objective metrics: FAD (distributional quality), KLD (audio similarity), CLAP (text consistency)
MOS tool: Flask web app for side-by-side audio comparison

Key considerations:

An epoch does not necessarily mean one pass over the entire dataset
Checkpointing happens every epoch (target ~30 min per epoch)
EMA (Exponential Moving Average) weights are tracked for the best model state
Training can be safely interrupted and resumed from the latest checkpoint

Step 7: Manage Checkpoints

Handle checkpoint saving, loading, and experiment management. Dora maintains checkpoints in signature-based folders, supports resuming interrupted training, and tracks the best model state based on validation metrics.

Checkpoint operations:

Automatic saving every epoch with best state tracking
Resume: simply re-run the same dora command (same signature reuses folder)
Clear and restart: dora run --clear to discard previous checkpoints
Access trained model: MusicGenSolver.get_eval_solver_from_sig('SIG')

Execution Diagram

GitHub URL

Workflow Repository