Heuristic:Facebookresearch Audiocraft Generation Parameter Defaults
| Knowledge Sources | |
|---|---|
| Domains | Audio_Generation, LLMs, Optimization |
| Last Updated | 2026-02-13 23:00 GMT |
Overview
Tuned default generation parameters for MusicGen (top_k=250, cfg_coef=3.0, duration=30s) and JASCO (cfg_coef_all=5.0) that balance quality and diversity.
Description
AudioCraft's generation models use carefully tuned default parameters for sampling and classifier-free guidance (CFG). These defaults represent the paper-validated sweet spots for balancing output quality, diversity, and text adherence. MusicGen uses single-source CFG with coefficient 3.0, while JASCO uses multi-source CFG that allows independent weighting of text, chord, drum, and melody conditions.
Key parameters include top_k (number of tokens to sample from), top_p (nucleus sampling threshold), temperature (sampling randomness), cfg_coef (guidance strength), and extend_stride (context window for long-form generation).
Usage
Use these defaults as starting points when generating audio. Adjust parameters based on your needs: increase cfg_coef for stronger text adherence (at the cost of diversity), lower top_k for more focused outputs, or increase temperature for more creative/varied results.
The Insight (Rule of Thumb)
- Action: Use the following default parameters as baselines for generation.
- Value:
- MusicGen:
top_k=250,top_p=0.0(disabled),temperature=1.0,cfg_coef=3.0,duration=30.0s,extend_stride=18s - AudioGen: Same as MusicGen but
duration=10.0s(environmental sounds are typically shorter) - JASCO:
cfg_coef_all=5.0(all conditions),cfg_coef_txt=0.0(text-only additional guidance),euler_steps=100(ODE solver steps)
- MusicGen:
- Trade-off:
- Higher
cfg_coef→ stronger text adherence but less diversity and potential artifacts - Lower
top_k→ more focused but potentially repetitive outputs - Higher
temperature→ more diverse but potentially lower quality top_p > 0switches from top-k to nucleus sampling (mutually exclusive in practice)
- Higher
Reasoning
The top_k=250 default samples from a large token vocabulary (~2048 codebook entries), keeping roughly the top 12% of candidates. This provides good diversity while filtering out low-probability tokens that would produce noise.
The cfg_coef=3.0 for MusicGen is moderate compared to image generation (typically 7.5+), because audio generation is more sensitive to over-guidance — artifacts manifest as repetitive patterns or distortion.
JASCO uses a higher base cfg_coef_all=5.0 because multi-source conditioning dilutes the per-condition influence; the higher coefficient compensates. The optional cfg_coef_txt enables "double CFG" where text conditioning receives extra guidance on top of cfg_coef_all (see paper Section 4.3 of https://arxiv.org/pdf/2407.12563).
The extend_stride=18s for long-form generation (>30s) means each continuation chunk overlaps the previous by 12 seconds (30 - 18 = 12), maintaining temporal coherence.
Code Evidence
MusicGen defaults from audiocraft/models/musicgen.py:96-132:
def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
top_p: float = 0.0, temperature: float = 1.0,
duration: float = 30.0, cfg_coef: float = 3.0,
two_step_cfg: bool = False, extend_stride: float = 18):
AudioGen shorter default duration from audiocraft/models/audiogen.py:63-75:
def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
top_p: float = 0.0, temperature: float = 1.0,
duration: float = 10.0, cfg_coef: float = 3.0,
...):
JASCO multi-source CFG from audiocraft/models/jasco.py:66-82:
def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
top_p: float = 0.0, temperature: float = 1.0,
duration: float = 30.0,
cfg_coef_all: float = 5.0,
cfg_coef_txt: float = 0.0,
...):
Double CFG documentation from audiocraft/models/musicgen.py:111-113:
cfg_coef_beta (float, optional): beta coefficient in double classifier free guidance.
Should be only used for MusicGen melody if we want to push the text condition more than
the audio conditioning. See paragraph 4.3 in https://arxiv.org/pdf/2407.12563