Principle:Facebookresearch Audiocraft Temporal Conditioning Preparation
Overview
Temporal Conditioning Preparation is the process of transforming raw multi-modal conditioning inputs -- chord progressions, drum audio, and melody contours -- into the internal representation format expected by JASCO's flow matching model. Each conditioning modality undergoes a distinct preprocessing pipeline that converts human-interpretable inputs (chord labels, audio waveforms, salience matrices) into tensor-based ConditioningAttributes that the model can consume during generation.
Theoretical Background
Symbolic Music Conditioning
JASCO supports symbolic conditioning, where musical structure is specified through discrete symbolic representations rather than raw audio:
- Chord conditioning: Chord progressions are specified as a list of (chord_label, start_time) tuples. The preprocessing converts these into per-frame integer sequences using the Chordino chord vocabulary mapping. Each frame at the model's frame rate (typically 50 Hz) is assigned the chord index active at that time point.
- Melody conditioning: Melody is specified as a pre-computed salience matrix of shape
[B, melody_bins, T], wheremelody_bins(default 53) represents pitch bins andTis the number of temporal frames. This matrix encodes the pitch content at each time step, typically extracted from a melody using a pitch salience algorithm.
Audio-Domain Conditioning
- Drums conditioning: Drum patterns are provided as raw audio waveforms. The
DrumsConditionerprocesses these through a sophisticated pipeline: stem separation using Demucs to isolate the drum track, encoding through EnCodec to get continuous latents, quantization to the coarsest codebook, dequantization back to continuous space, and temporal blurring to create a smooth conditioning signal.
Temporal Alignment
All conditioning signals must be aligned to the model's temporal resolution (frame rate). The preparation step handles:
- Padding short signals to the expected sequence length
- Trimming long signals to match the segment duration
- Converting between time-domain (seconds) and frame-domain (frame indices) representations
Key Concepts
| Concept | Input Format | Internal Format | Conditioner Class |
|---|---|---|---|
| Chords | List[Tuple[str, float]] -- chord labels with start times |
SymbolicCondition(frame_chords=Tensor[T]) -- per-frame integer indices |
ChordsEmbConditioner
|
| Drums | Tensor[B, C, T] -- raw drum audio waveform |
WavCondition -- wrapped waveform with metadata |
DrumsConditioner
|
| Melody | Tensor[B, 53, T] -- pre-computed salience matrix |
SymbolicCondition(melody=Tensor[B, 53, T]) |
MelodyConditioner
|
JASCO Conditioner Architecture
The temporal conditioning system consists of several specialized classes:
- ChordsEmbConditioner (
jasco_conditioners.py:L36-56): Embeds integer chord indices into continuous vectors using annn.Embeddinglayer. Vocabulary size iscard + 1to accommodate a null chord token used during dropout. - DrumsConditioner (
jasco_conditioners.py:L59-214): A complex waveform conditioner that performs drum stem separation via Demucs, encodes to EnCodec latents, quantizes to the coarsest codebook, decodes back to continuous space, and applies temporal blurring. Supports embedding caching for training efficiency. - MelodyConditioner (
jasco_conditioners.py:L15-33): Projects pre-computed salience matrices through a linear output projection, treating the melody bins as the input dimension. - JascoConditioningProvider (
jasco_conditioners.py:L216-300): Orchestrates tokenization and collation of all conditioning types, including null-condition handling for inputs that are not provided.
Design Rationale
- Modular conditioning: Each modality has its own conditioner class, making it straightforward to add new conditioning types or modify existing ones.
- Graceful degradation: When a conditioning modality is not provided, null conditions are automatically substituted (zero tensors for drums and melody, null chord token for chords), enabling generation with any subset of conditions.
- Temporal blurring for drums: The drum latents are temporally blurred to provide a coarse rhythmic pattern rather than frame-exact timing, which gives the model more creative freedom in placing individual drum hits.