Workflow:Facebookresearch Audiocraft Model Export And Deployment
| Knowledge Sources | |
|---|---|
| Domains | Model_Deployment, Checkpoint_Management, Audio_Generation |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
End-to-end process for exporting trained AudioCraft models (MusicGen, AudioGen, EnCodec) from training checkpoints to lightweight release-ready formats and loading them for inference.
Description
This workflow covers the complete model export pipeline for AudioCraft. After training a model using Dora, the training checkpoint contains optimizer state, EMA weights, FSDP sharding information, and other training artifacts that are unnecessary for inference. This workflow exports only the essential model weights and configuration, bundles the language model with its companion EnCodec tokenizer, and demonstrates how to load the exported model using the high-level generation API. It also covers fine-tuning checkpoint preparation, including the special case of converting mono models for stereo fine-tuning.
Usage
Execute this workflow after completing model training when you need to create a distributable, inference-ready model package. Also use this workflow when preparing a pretrained model for fine-tuning with modified architecture (e.g., mono to stereo conversion).
Execution Steps
Step 1: Locate Training Checkpoint
Identify the training checkpoint to export using the Dora experiment signature. Each experiment has a unique signature (hash) that maps to a specific folder containing checkpoints. Use the Dora API to resolve the signature to a filesystem path.
Key considerations:
- Use train.main.get_xp_from_sig('SIG') to get the experiment object
- The checkpoint file is at xp.folder / 'checkpoint.th'
- For FSDP-trained models, the best state may be in fsdp_best_state instead of best_state
- Verify the training completed successfully before exporting
Step 2: Export Language Model
Export the language model (MusicGen, AudioGen, MAGNeT, or JASCO) checkpoint to a lightweight format. The export function extracts only the best model state dictionary and the Hydra configuration, discarding optimizer state, training history, and FSDP metadata.
What gets exported:
- best_state: the model weights achieving the best validation metric
- xp.cfg: the Hydra configuration serialized as YAML
- version: AudioCraft version for compatibility tracking
- exported: flag marking this as an export checkpoint
Key considerations:
- FSDP models store best state differently (fsdp_best_state.model)
- The exported file is significantly smaller than the training checkpoint
- Output is a standard torch.save dictionary
Step 3: Export Compression Model
Bundle the EnCodec compression model that was used during training. This step depends on whether you trained your own EnCodec or used a pretrained one.
Two cases:
- Custom EnCodec: export the trained EnCodec checkpoint using export_encodec()
- Pretrained EnCodec: create a reference pointer using export_pretrained_compression_model()
Key considerations:
- When using a pretrained model, only a reference string is stored (not the actual weights)
- The reference will trigger automatic download from HuggingFace at load time
- Both files must be in the same directory for the loader to find them
Step 4: Organize Export Directory
Structure the exported files in a directory that the AudioCraft loader expects. The directory must contain both the language model state dict and the compression model state dict with specific filenames.
Required directory structure:
- state_dict.bin - the exported language model weights
- compression_state_dict.bin - the exported or referenced EnCodec weights
- Both files must be in the same parent directory
Step 5: Validate Exported Model
Load the exported model using the high-level generation API and verify it produces valid output. This confirms that the export process preserved the model correctly and that inference works end-to-end.
Validation approach:
- Load via MusicGen.get_pretrained('/path/to/export/dir/')
- Run a short generation with a test description
- Compare output quality to pre-export generation
- Verify sample rate and duration match expectations
Step 6: Prepare Fine_tuning Checkpoints (Optional)
For special fine-tuning scenarios such as converting a mono model to stereo, manually modify the checkpoint structure. This involves duplicating embedding and linear layer weights to accommodate the doubled codebook count (left + right channels interleaved).
Mono to stereo conversion:
- Load the exported state dict
- Duplicate embedding and linear weights for paired codebooks
- Save with the training checkpoint format: dict with best_state.model key
- Use as continue_from target without the //pretrained/ prefix