Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Facebookresearch Audiocraft JASCO Conditioned Music Generation

From Leeroopedia
Knowledge Sources
Domains Audio_Generation, Music_Generation, Symbolic_Conditioning
Last Updated 2026-02-13 23:00 GMT

Overview

End-to-end process for generating music with fine-grained temporal control using the JASCO model with text, chord, drum, and melody conditioning.

Description

This workflow covers the inference pipeline for JASCO (Joint Audio and Symbolic Conditioning), a flow-matching-based music generation model that supports both global conditioning (text descriptions) and local temporal conditioning (chord progressions, drum patterns, melody contours). Unlike autoregressive models like MusicGen, JASCO uses continuous flow matching over EnCodec latent representations, enabling more precise temporal alignment between conditioning signals and generated audio. The workflow includes model loading, conditioning input preparation (including chord extraction and melody salience preprocessing), and generation with configurable guidance coefficients.

Usage

Execute this workflow when you need to generate music with fine-grained control over harmonic structure (chords), rhythmic patterns (drums), or melodic contour. This is ideal for music production workflows where the user has specific musical structure requirements beyond what text descriptions alone can express.

Execution Steps

Step 1: Environment Setup

Install AudioCraft and the additional dependencies required for JASCO's symbolic conditioning. This includes the chord_extractor tool for extracting chord annotations from audio, and optionally the deep salience model for melody conditioning.

Key considerations:

  • Base AudioCraft installation required
  • Chord extraction requires the Chordino/NNLS-Chroma extractor compiled from source
  • Melody conditioning requires the ismir2017-deepsalience repository with a Python 3.7 environment
  • A predefined chord-to-index mapping file is provided in the assets directory

Step 2: Load Pretrained JASCO Model

Load a pretrained JASCO model from HuggingFace Hub using the JASCO API. The model requires a chord mapping file that translates chord names to index values used by the chord conditioner.

Available models:

  • facebook/jasco-chords-drums-400M: chords + drums, 10s generation
  • facebook/jasco-chords-drums-1B: chords + drums, 10s generation
  • facebook/jasco-chords-drums-melody-400M: chords + drums + melody, 10s generation
  • facebook/jasco-chords-drums-melody-1B: chords + drums + melody, 10s generation

Key considerations:

  • The chords_mapping_path parameter must point to the chord-to-index mapping pickle file
  • Models are loaded with both the flow matching model and the EnCodec compression model
  • Model selection determines which conditioning types are supported

Step 3: Configure Generation Parameters

Set the JASCO generation parameters that control the flow matching sampling process and classifier-free guidance. JASCO uses a different guidance scheme than autoregressive models, with separate coefficients for overall and text-specific guidance.

Key parameters:

  • cfg_coef_all: overall classifier-free guidance coefficient (recommended 5.0)
  • cfg_coef_txt: text-specific guidance coefficient (set to 0.0 when using symbolic conditioning)
  • Generation duration is model-dependent (typically 10 seconds)

Step 4: Prepare Conditioning Inputs

Prepare the multi-modal conditioning inputs for generation. This involves specifying a text description, chord progression, optional drum pattern, and optional melody salience map.

Chord conditioning:

  • Provide a list of (chord_name, onset_time) tuples
  • Chord names follow standard notation (e.g., "C", "Am", "F#m7")
  • Onset times are in seconds, defining when each chord begins

Drum conditioning:

  • Provide drum audio or latent representations
  • Drum patterns are encoded into latent space for conditioning

Melody conditioning:

  • Requires pre-extracted salience maps from the deep salience model
  • Salience maps represent pitch activity over time

Step 5: Run Flow Matching Generation

Execute the generation process using JASCO's flow matching paradigm. Unlike autoregressive generation, flow matching iteratively refines a noise sample into the target latent representation guided by the conditioning signals.

What happens:

  • Text descriptions are encoded via T5 into global conditioning embeddings
  • Symbolic conditions (chords, drums, melody) are processed by specialized conditioners
  • The flow matching model iteratively denoises a random latent sample
  • Conditioning signals guide the denoising trajectory at each step
  • The result is a continuous latent representation in EnCodec's latent space

Step 6: Decode and Save Output

Decode the generated latent representation back to an audio waveform using the EnCodec decoder and save the result with loudness normalization.

Key considerations:

  • The flow matching output is in continuous latent space (not discrete tokens)
  • EnCodec decoder maps latents to 32 kHz audio waveforms
  • Output is saved using audio_write with loudness normalization to -14 dB LUFS
  • Generated duration is typically 10 seconds for current pretrained models

Execution Diagram

GitHub URL

Workflow Repository