Workflow:Facebookresearch Audiocraft Model Export And Deployment

Knowledge Sources	AudioCraft AudioCraft Docs
Domains	Model_Deployment, Checkpoint_Management, Audio_Generation
Last Updated	2026-02-13 23:00 GMT

Overview

End-to-end process for exporting trained AudioCraft models (MusicGen, AudioGen, EnCodec) from training checkpoints to lightweight release-ready formats and loading them for inference.

Description

This workflow covers the complete model export pipeline for AudioCraft. After training a model using Dora, the training checkpoint contains optimizer state, EMA weights, FSDP sharding information, and other training artifacts that are unnecessary for inference. This workflow exports only the essential model weights and configuration, bundles the language model with its companion EnCodec tokenizer, and demonstrates how to load the exported model using the high-level generation API. It also covers fine-tuning checkpoint preparation, including the special case of converting mono models for stereo fine-tuning.

Usage

Execute this workflow after completing model training when you need to create a distributable, inference-ready model package. Also use this workflow when preparing a pretrained model for fine-tuning with modified architecture (e.g., mono to stereo conversion).

Execution Steps

Step 1: Locate Training Checkpoint

Identify the training checkpoint to export using the Dora experiment signature. Each experiment has a unique signature (hash) that maps to a specific folder containing checkpoints. Use the Dora API to resolve the signature to a filesystem path.

Key considerations:

Use train.main.get_xp_from_sig('SIG') to get the experiment object
The checkpoint file is at xp.folder / 'checkpoint.th'
For FSDP-trained models, the best state may be in fsdp_best_state instead of best_state
Verify the training completed successfully before exporting

Step 2: Export Language Model

Export the language model (MusicGen, AudioGen, MAGNeT, or JASCO) checkpoint to a lightweight format. The export function extracts only the best model state dictionary and the Hydra configuration, discarding optimizer state, training history, and FSDP metadata.

What gets exported:

best_state: the model weights achieving the best validation metric
xp.cfg: the Hydra configuration serialized as YAML
version: AudioCraft version for compatibility tracking
exported: flag marking this as an export checkpoint

Key considerations:

FSDP models store best state differently (fsdp_best_state.model)
The exported file is significantly smaller than the training checkpoint
Output is a standard torch.save dictionary

Step 3: Export Compression Model

Bundle the EnCodec compression model that was used during training. This step depends on whether you trained your own EnCodec or used a pretrained one.

Two cases:

Custom EnCodec: export the trained EnCodec checkpoint using export_encodec()
Pretrained EnCodec: create a reference pointer using export_pretrained_compression_model()

Key considerations:

When using a pretrained model, only a reference string is stored (not the actual weights)
The reference will trigger automatic download from HuggingFace at load time
Both files must be in the same directory for the loader to find them

Step 4: Organize Export Directory

Structure the exported files in a directory that the AudioCraft loader expects. The directory must contain both the language model state dict and the compression model state dict with specific filenames.

Required directory structure:

state_dict.bin - the exported language model weights
compression_state_dict.bin - the exported or referenced EnCodec weights
Both files must be in the same parent directory

Step 5: Validate Exported Model

Load the exported model using the high-level generation API and verify it produces valid output. This confirms that the export process preserved the model correctly and that inference works end-to-end.

Validation approach:

Load via MusicGen.get_pretrained('/path/to/export/dir/')
Run a short generation with a test description
Compare output quality to pre-export generation
Verify sample rate and duration match expectations

Step 6: Prepare Fine_tuning Checkpoints (Optional)

For special fine-tuning scenarios such as converting a mono model to stereo, manually modify the checkpoint structure. This involves duplicating embedding and linear layer weights to accommodate the doubled codebook count (left + right channels interleaved).

Mono to stereo conversion:

Load the exported state dict
Duplicate embedding and linear weights for paired codebooks
Save with the training checkpoint format: dict with best_state.model key
Use as continue_from target without the //pretrained/ prefix

Execution Diagram

GitHub URL

Workflow Repository