Principle:Facebookresearch Audiocraft Language Model Export
Overview
Language Model Export is the process of converting a full training checkpoint -- containing optimizer states, EMA weights, training metadata, and other training-only artifacts -- into a lightweight deployment-ready format containing only the parameters needed for inference. This transformation is essential for distributing trained MusicGen or AudioGen language models, as training checkpoints can be many times larger than the actual model weights.
Theoretical Background
During training, AudioCraft's solver framework (built on Dora and Flashy) saves comprehensive checkpoints that include:
- Model state dict (
best_state) -- the best-performing model weights, nested under amodelkey - FSDP best state (
fsdp_best_state) -- an alternative best state when using Fully Sharded Data Parallel training, also nested under amodelkey - Optimizer state -- full optimizer parameter groups and momentum buffers
- EMA state -- exponential moving average of model parameters
- Training configuration (
xp.cfg) -- the Hydra/OmegaConf configuration used for the experiment - Epoch and metrics history -- training progress tracking information
For deployment, only the model weights and the configuration (to reconstruct the model architecture) are needed. The export process strips away all training-specific data and serializes the result in a standardized format that includes an AudioCraft version tag for compatibility tracking.
Key Concepts
| Concept | Description |
|---|---|
| best_state | The model state dictionary corresponding to the best validation performance during training, stored under pkg['best_state']['model']
|
| fsdp_best_state | An alternative best state key used when the model was trained with FSDP; takes priority over best_state when present and non-empty
|
| xp.cfg | The experiment configuration serialized as a YAML string via OmegaConf.to_yaml(), preserving all architecture and dataset parameters
|
| exported flag | A boolean exported: True marker that downstream loaders use to distinguish exported checkpoints from training checkpoints
|
| version tag | The AudioCraft library version string embedded in the export for compatibility verification |
FSDP Handling
When models are trained using PyTorch's Fully Sharded Data Parallel (FSDP), the best model state is stored under a different key (fsdp_best_state) because FSDP shards model parameters across ranks. The export function first checks for this key, and if it contains data, uses it as the source of truth for the model weights. This ensures that models trained with either standard DataParallel or FSDP can be exported through the same pipeline.
Design Rationale
- Size reduction: Training checkpoints can be 3-4x larger than the model weights alone due to optimizer state (e.g., Adam has 2 state tensors per parameter).
- Portability: Exported models are self-contained with their configuration, enabling reconstruction on any machine without access to the original training setup.
- Version tracking: The embedded version string helps diagnose compatibility issues when loading models across AudioCraft versions.