Principle:Facebookresearch Audiocraft Temporal Conditioning Preparation

Overview

Temporal Conditioning Preparation is the process of transforming raw multi-modal conditioning inputs -- chord progressions, drum audio, and melody contours -- into the internal representation format expected by JASCO's flow matching model. Each conditioning modality undergoes a distinct preprocessing pipeline that converts human-interpretable inputs (chord labels, audio waveforms, salience matrices) into tensor-based ConditioningAttributes that the model can consume during generation.

Theoretical Background

Symbolic Music Conditioning

JASCO supports symbolic conditioning, where musical structure is specified through discrete symbolic representations rather than raw audio:

Chord conditioning: Chord progressions are specified as a list of (chord_label, start_time) tuples. The preprocessing converts these into per-frame integer sequences using the Chordino chord vocabulary mapping. Each frame at the model's frame rate (typically 50 Hz) is assigned the chord index active at that time point.

Melody conditioning: Melody is specified as a pre-computed salience matrix of shape [B, melody_bins, T], where melody_bins (default 53) represents pitch bins and T is the number of temporal frames. This matrix encodes the pitch content at each time step, typically extracted from a melody using a pitch salience algorithm.

Audio-Domain Conditioning

Drums conditioning: Drum patterns are provided as raw audio waveforms. The DrumsConditioner processes these through a sophisticated pipeline: stem separation using Demucs to isolate the drum track, encoding through EnCodec to get continuous latents, quantization to the coarsest codebook, dequantization back to continuous space, and temporal blurring to create a smooth conditioning signal.

Temporal Alignment

All conditioning signals must be aligned to the model's temporal resolution (frame rate). The preparation step handles:

Padding short signals to the expected sequence length
Trimming long signals to match the segment duration
Converting between time-domain (seconds) and frame-domain (frame indices) representations

Key Concepts

Concept	Input Format	Internal Format	Conditioner Class
Chords	`List[Tuple[str, float]]` -- chord labels with start times	`SymbolicCondition(frame_chords=Tensor[T])` -- per-frame integer indices	`ChordsEmbConditioner`
Drums	`Tensor[B, C, T]` -- raw drum audio waveform	`WavCondition` -- wrapped waveform with metadata	`DrumsConditioner`
Melody	`Tensor[B, 53, T]` -- pre-computed salience matrix	`SymbolicCondition(melody=Tensor[B, 53, T])`	`MelodyConditioner`

JASCO Conditioner Architecture

The temporal conditioning system consists of several specialized classes:

ChordsEmbConditioner (jasco_conditioners.py:L36-56): Embeds integer chord indices into continuous vectors using an nn.Embedding layer. Vocabulary size is card + 1 to accommodate a null chord token used during dropout.
DrumsConditioner (jasco_conditioners.py:L59-214): A complex waveform conditioner that performs drum stem separation via Demucs, encodes to EnCodec latents, quantizes to the coarsest codebook, decodes back to continuous space, and applies temporal blurring. Supports embedding caching for training efficiency.
MelodyConditioner (jasco_conditioners.py:L15-33): Projects pre-computed salience matrices through a linear output projection, treating the melody bins as the input dimension.
JascoConditioningProvider (jasco_conditioners.py:L216-300): Orchestrates tokenization and collation of all conditioning types, including null-condition handling for inputs that are not provided.

Design Rationale

Modular conditioning: Each modality has its own conditioner class, making it straightforward to add new conditioning types or modify existing ones.
Graceful degradation: When a conditioning modality is not provided, null conditions are automatically substituted (zero tensors for drums and melody, null chord token for chords), enabling generation with any subset of conditions.
Temporal blurring for drums: The drum latents are temporally blurred to provide a coarse rhythmic pattern rather than frame-exact timing, which gives the model more creative freedom in placing individual drum hits.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment