Workflow:Facebookresearch Audiocraft JASCO Conditioned Music Generation

Knowledge Sources	AudioCraft JASCO Paper AudioCraft Docs
Domains	Audio_Generation, Music_Generation, Symbolic_Conditioning
Last Updated	2026-02-13 23:00 GMT

Overview

End-to-end process for generating music with fine-grained temporal control using the JASCO model with text, chord, drum, and melody conditioning.

Description

This workflow covers the inference pipeline for JASCO (Joint Audio and Symbolic Conditioning), a flow-matching-based music generation model that supports both global conditioning (text descriptions) and local temporal conditioning (chord progressions, drum patterns, melody contours). Unlike autoregressive models like MusicGen, JASCO uses continuous flow matching over EnCodec latent representations, enabling more precise temporal alignment between conditioning signals and generated audio. The workflow includes model loading, conditioning input preparation (including chord extraction and melody salience preprocessing), and generation with configurable guidance coefficients.

Usage

Execute this workflow when you need to generate music with fine-grained control over harmonic structure (chords), rhythmic patterns (drums), or melodic contour. This is ideal for music production workflows where the user has specific musical structure requirements beyond what text descriptions alone can express.

Execution Steps

Step 1: Environment Setup

Install AudioCraft and the additional dependencies required for JASCO's symbolic conditioning. This includes the chord_extractor tool for extracting chord annotations from audio, and optionally the deep salience model for melody conditioning.

Key considerations:

Base AudioCraft installation required
Chord extraction requires the Chordino/NNLS-Chroma extractor compiled from source
Melody conditioning requires the ismir2017-deepsalience repository with a Python 3.7 environment
A predefined chord-to-index mapping file is provided in the assets directory

Step 2: Load Pretrained JASCO Model

Load a pretrained JASCO model from HuggingFace Hub using the JASCO API. The model requires a chord mapping file that translates chord names to index values used by the chord conditioner.

Available models:

facebook/jasco-chords-drums-400M: chords + drums, 10s generation
facebook/jasco-chords-drums-1B: chords + drums, 10s generation
facebook/jasco-chords-drums-melody-400M: chords + drums + melody, 10s generation
facebook/jasco-chords-drums-melody-1B: chords + drums + melody, 10s generation

Key considerations:

The chords_mapping_path parameter must point to the chord-to-index mapping pickle file
Models are loaded with both the flow matching model and the EnCodec compression model
Model selection determines which conditioning types are supported

Step 3: Configure Generation Parameters

Set the JASCO generation parameters that control the flow matching sampling process and classifier-free guidance. JASCO uses a different guidance scheme than autoregressive models, with separate coefficients for overall and text-specific guidance.

Key parameters:

cfg_coef_all: overall classifier-free guidance coefficient (recommended 5.0)
cfg_coef_txt: text-specific guidance coefficient (set to 0.0 when using symbolic conditioning)
Generation duration is model-dependent (typically 10 seconds)

Step 4: Prepare Conditioning Inputs

Prepare the multi-modal conditioning inputs for generation. This involves specifying a text description, chord progression, optional drum pattern, and optional melody salience map.

Chord conditioning:

Provide a list of (chord_name, onset_time) tuples
Chord names follow standard notation (e.g., "C", "Am", "F#m7")
Onset times are in seconds, defining when each chord begins

Drum conditioning:

Provide drum audio or latent representations
Drum patterns are encoded into latent space for conditioning

Melody conditioning:

Requires pre-extracted salience maps from the deep salience model
Salience maps represent pitch activity over time

Step 5: Run Flow Matching Generation

Execute the generation process using JASCO's flow matching paradigm. Unlike autoregressive generation, flow matching iteratively refines a noise sample into the target latent representation guided by the conditioning signals.

What happens:

Text descriptions are encoded via T5 into global conditioning embeddings
Symbolic conditions (chords, drums, melody) are processed by specialized conditioners
The flow matching model iteratively denoises a random latent sample
Conditioning signals guide the denoising trajectory at each step
The result is a continuous latent representation in EnCodec's latent space

Step 6: Decode and Save Output

Decode the generated latent representation back to an audio waveform using the EnCodec decoder and save the result with loudness normalization.

Key considerations:

The flow matching output is in continuous latent space (not discrete tokens)
EnCodec decoder maps latents to 32 kHz audio waveforms
Output is saved using audio_write with loudness normalization to -14 dB LUFS
Generated duration is typically 10 seconds for current pretrained models

Execution Diagram

GitHub URL

Workflow Repository