Heuristic:Facebookresearch Audiocraft Generation Parameter Defaults

Knowledge Sources	AudioCraft MusicGen Double CFG
Domains	Audio_Generation, LLMs, Optimization
Last Updated	2026-02-13 23:00 GMT

Overview

Tuned default generation parameters for MusicGen (top_k=250, cfg_coef=3.0, duration=30s) and JASCO (cfg_coef_all=5.0) that balance quality and diversity.

Description

AudioCraft's generation models use carefully tuned default parameters for sampling and classifier-free guidance (CFG). These defaults represent the paper-validated sweet spots for balancing output quality, diversity, and text adherence. MusicGen uses single-source CFG with coefficient 3.0, while JASCO uses multi-source CFG that allows independent weighting of text, chord, drum, and melody conditions.

Key parameters include top_k (number of tokens to sample from), top_p (nucleus sampling threshold), temperature (sampling randomness), cfg_coef (guidance strength), and extend_stride (context window for long-form generation).

Usage

Use these defaults as starting points when generating audio. Adjust parameters based on your needs: increase cfg_coef for stronger text adherence (at the cost of diversity), lower top_k for more focused outputs, or increase temperature for more creative/varied results.

The Insight (Rule of Thumb)

Action: Use the following default parameters as baselines for generation.
Value:
- MusicGen: top_k=250, top_p=0.0 (disabled), temperature=1.0, cfg_coef=3.0, duration=30.0s, extend_stride=18s
- AudioGen: Same as MusicGen but duration=10.0s (environmental sounds are typically shorter)
- JASCO: cfg_coef_all=5.0 (all conditions), cfg_coef_txt=0.0 (text-only additional guidance), euler_steps=100 (ODE solver steps)
Trade-off:
- Higher cfg_coef → stronger text adherence but less diversity and potential artifacts
- Lower top_k → more focused but potentially repetitive outputs
- Higher temperature → more diverse but potentially lower quality
- top_p > 0 switches from top-k to nucleus sampling (mutually exclusive in practice)

Reasoning

The top_k=250 default samples from a large token vocabulary (~2048 codebook entries), keeping roughly the top 12% of candidates. This provides good diversity while filtering out low-probability tokens that would produce noise.

The cfg_coef=3.0 for MusicGen is moderate compared to image generation (typically 7.5+), because audio generation is more sensitive to over-guidance — artifacts manifest as repetitive patterns or distortion.

JASCO uses a higher base cfg_coef_all=5.0 because multi-source conditioning dilutes the per-condition influence; the higher coefficient compensates. The optional cfg_coef_txt enables "double CFG" where text conditioning receives extra guidance on top of cfg_coef_all (see paper Section 4.3 of https://arxiv.org/pdf/2407.12563).

The extend_stride=18s for long-form generation (>30s) means each continuation chunk overlaps the previous by 12 seconds (30 - 18 = 12), maintaining temporal coherence.

Code Evidence

MusicGen defaults from audiocraft/models/musicgen.py:96-132:

def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
                          top_p: float = 0.0, temperature: float = 1.0,
                          duration: float = 30.0, cfg_coef: float = 3.0,
                          two_step_cfg: bool = False, extend_stride: float = 18):

AudioGen shorter default duration from audiocraft/models/audiogen.py:63-75:

def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
                          top_p: float = 0.0, temperature: float = 1.0,
                          duration: float = 10.0, cfg_coef: float = 3.0,
                          ...):

JASCO multi-source CFG from audiocraft/models/jasco.py:66-82:

def set_generation_params(self, use_sampling: bool = True, top_k: int = 250,
                          top_p: float = 0.0, temperature: float = 1.0,
                          duration: float = 30.0,
                          cfg_coef_all: float = 5.0,
                          cfg_coef_txt: float = 0.0,
                          ...):

Double CFG documentation from audiocraft/models/musicgen.py:111-113:

cfg_coef_beta (float, optional): beta coefficient in double classifier free guidance.
    Should be only used for MusicGen melody if we want to push the text condition more than
    the audio conditioning. See paragraph 4.3 in https://arxiv.org/pdf/2407.12563

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment