Principle:Facebookresearch Audiocraft JASCO Generation Configuration
Overview
JASCO Generation Configuration governs how multi-source classifier-free guidance (CFG) is parameterized for JASCO's flow matching generation process. Unlike MusicGen's single CFG coefficient, JASCO decomposes guidance into multiple terms, each controlling the influence of different conditioning sources (all conditions together, text only, or no conditions). This multi-source CFG approach enables fine-grained control over how strongly each type of conditioning (text, chords, drums, melody) affects the generated music.
Theoretical Background
Standard Classifier-Free Guidance
In standard CFG (Ho & Salimans, 2022), the guided prediction is:
v_guided = v_unconditional + cfg_coef * (v_conditional - v_unconditional)
This provides a single knob to control the trade-off between sample quality (higher CFG) and diversity (lower CFG).
Multi-Source Classifier-Free Guidance
JASCO extends this to multiple conditioning sources. The guided vector field is a weighted sum of multiple terms, each computed with a different subset of conditions:
v_guided = w_all * v_all + w_txt * v_txt + w_null * v_null
Where:
v_allis the vector field with all conditions active (text + temporal)v_txtis the vector field with text only (temporal conditions dropped)v_nullis the vector field with no conditions (fully unconditional)w_null = 1 - w_all - w_txt(weights sum to 1)
This decomposition allows independent control over:
- How much the generation follows the temporal structure (chords, drums, melody) -- via
cfg_coef_all - How much the generation follows the text description specifically -- via
cfg_coef_txt
Key Concepts
| Parameter | Default | Role |
|---|---|---|
cfg_coef_all |
5.0 | Weight for the fully-conditioned term (all conditions: text + chords + drums + melody) |
cfg_coef_txt |
0.0 | Weight for the text-only conditioned term (temporal conditions dropped) |
| Null weight | 1 - cfg_coef_all - cfg_coef_txt |
Implicit weight for the unconditional term, computed automatically |
When cfg_coef_txt = 0.0 (the default), the guidance simplifies to a two-term scheme: fully conditioned vs. unconditional, similar to standard CFG but applied to the flow matching vector field.
CFG Term Classes
JASCO implements three distinct CFG term types:
| Term Class | Conditions Retained | Conditions Dropped | Purpose |
|---|---|---|---|
AllCFGTerm |
All (text + symbolic + wav) | None | Fully conditioned generation |
TextCFGTerm |
Text only | Symbolic (chords, melody) and wav (drums) | Text-only guidance |
NullCFGTerm |
None | All | Unconditional baseline |
Terms with negligible weight (absolute value below 1e-6) are removed to save computation.
Design Rationale
- Compositional control: The multi-source decomposition lets users adjust temporal adherence and semantic adherence independently, rather than trading one for the other.
- Backward compatibility: With
cfg_coef_txt=0.0, the system reduces to standard two-term CFG, making it a strict generalization. - Extensibility via kwargs: Additional generation parameters can be passed through
**kwargsand are forwarded to the underlyingFlowMatchingModel.generate()call, supporting future extensions.