Principle:Facebookresearch Audiocraft JASCO Generation Configuration

Overview

JASCO Generation Configuration governs how multi-source classifier-free guidance (CFG) is parameterized for JASCO's flow matching generation process. Unlike MusicGen's single CFG coefficient, JASCO decomposes guidance into multiple terms, each controlling the influence of different conditioning sources (all conditions together, text only, or no conditions). This multi-source CFG approach enables fine-grained control over how strongly each type of conditioning (text, chords, drums, melody) affects the generated music.

Theoretical Background

Standard Classifier-Free Guidance

In standard CFG (Ho & Salimans, 2022), the guided prediction is:

v_guided = v_unconditional + cfg_coef * (v_conditional - v_unconditional)

This provides a single knob to control the trade-off between sample quality (higher CFG) and diversity (lower CFG).

Multi-Source Classifier-Free Guidance

JASCO extends this to multiple conditioning sources. The guided vector field is a weighted sum of multiple terms, each computed with a different subset of conditions:

v_guided = w_all * v_all + w_txt * v_txt + w_null * v_null

Where:

v_all is the vector field with all conditions active (text + temporal)
v_txt is the vector field with text only (temporal conditions dropped)
v_null is the vector field with no conditions (fully unconditional)
w_null = 1 - w_all - w_txt (weights sum to 1)

This decomposition allows independent control over:

How much the generation follows the temporal structure (chords, drums, melody) -- via cfg_coef_all
How much the generation follows the text description specifically -- via cfg_coef_txt

Key Concepts

Parameter	Default	Role
`cfg_coef_all`	5.0	Weight for the fully-conditioned term (all conditions: text + chords + drums + melody)
`cfg_coef_txt`	0.0	Weight for the text-only conditioned term (temporal conditions dropped)
Null weight	`1 - cfg_coef_all - cfg_coef_txt`	Implicit weight for the unconditional term, computed automatically

When cfg_coef_txt = 0.0 (the default), the guidance simplifies to a two-term scheme: fully conditioned vs. unconditional, similar to standard CFG but applied to the flow matching vector field.

CFG Term Classes

JASCO implements three distinct CFG term types:

Term Class	Conditions Retained	Conditions Dropped	Purpose
`AllCFGTerm`	All (text + symbolic + wav)	None	Fully conditioned generation
`TextCFGTerm`	Text only	Symbolic (chords, melody) and wav (drums)	Text-only guidance
`NullCFGTerm`	None	All	Unconditional baseline

Terms with negligible weight (absolute value below 1e-6) are removed to save computation.

Design Rationale

Compositional control: The multi-source decomposition lets users adjust temporal adherence and semantic adherence independently, rather than trading one for the other.
Backward compatibility: With cfg_coef_txt=0.0, the system reduces to standard two-term CFG, making it a strict generalization.
Extensibility via kwargs: Additional generation parameters can be passed through **kwargs and are forwarded to the underlying FlowMatchingModel.generate() call, supporting future extensions.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment