Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Facebookresearch Audiocraft Temporal Conditioning Preparation

From Leeroopedia

Overview

Temporal Conditioning Preparation is the process of transforming raw multi-modal conditioning inputs -- chord progressions, drum audio, and melody contours -- into the internal representation format expected by JASCO's flow matching model. Each conditioning modality undergoes a distinct preprocessing pipeline that converts human-interpretable inputs (chord labels, audio waveforms, salience matrices) into tensor-based ConditioningAttributes that the model can consume during generation.

Theoretical Background

Symbolic Music Conditioning

JASCO supports symbolic conditioning, where musical structure is specified through discrete symbolic representations rather than raw audio:

  • Chord conditioning: Chord progressions are specified as a list of (chord_label, start_time) tuples. The preprocessing converts these into per-frame integer sequences using the Chordino chord vocabulary mapping. Each frame at the model's frame rate (typically 50 Hz) is assigned the chord index active at that time point.
  • Melody conditioning: Melody is specified as a pre-computed salience matrix of shape [B, melody_bins, T], where melody_bins (default 53) represents pitch bins and T is the number of temporal frames. This matrix encodes the pitch content at each time step, typically extracted from a melody using a pitch salience algorithm.

Audio-Domain Conditioning

  • Drums conditioning: Drum patterns are provided as raw audio waveforms. The DrumsConditioner processes these through a sophisticated pipeline: stem separation using Demucs to isolate the drum track, encoding through EnCodec to get continuous latents, quantization to the coarsest codebook, dequantization back to continuous space, and temporal blurring to create a smooth conditioning signal.

Temporal Alignment

All conditioning signals must be aligned to the model's temporal resolution (frame rate). The preparation step handles:

  • Padding short signals to the expected sequence length
  • Trimming long signals to match the segment duration
  • Converting between time-domain (seconds) and frame-domain (frame indices) representations

Key Concepts

Concept Input Format Internal Format Conditioner Class
Chords List[Tuple[str, float]] -- chord labels with start times SymbolicCondition(frame_chords=Tensor[T]) -- per-frame integer indices ChordsEmbConditioner
Drums Tensor[B, C, T] -- raw drum audio waveform WavCondition -- wrapped waveform with metadata DrumsConditioner
Melody Tensor[B, 53, T] -- pre-computed salience matrix SymbolicCondition(melody=Tensor[B, 53, T]) MelodyConditioner

JASCO Conditioner Architecture

The temporal conditioning system consists of several specialized classes:

  • ChordsEmbConditioner (jasco_conditioners.py:L36-56): Embeds integer chord indices into continuous vectors using an nn.Embedding layer. Vocabulary size is card + 1 to accommodate a null chord token used during dropout.
  • DrumsConditioner (jasco_conditioners.py:L59-214): A complex waveform conditioner that performs drum stem separation via Demucs, encodes to EnCodec latents, quantizes to the coarsest codebook, decodes back to continuous space, and applies temporal blurring. Supports embedding caching for training efficiency.
  • MelodyConditioner (jasco_conditioners.py:L15-33): Projects pre-computed salience matrices through a linear output projection, treating the melody bins as the input dimension.
  • JascoConditioningProvider (jasco_conditioners.py:L216-300): Orchestrates tokenization and collation of all conditioning types, including null-condition handling for inputs that are not provided.

Design Rationale

  • Modular conditioning: Each modality has its own conditioner class, making it straightforward to add new conditioning types or modify existing ones.
  • Graceful degradation: When a conditioning modality is not provided, null conditions are automatically substituted (zero tensors for drums and melody, null chord token for chords), enabling generation with any subset of conditions.
  • Temporal blurring for drums: The drum latents are temporally blurred to provide a coarse rhythmic pattern rather than frame-exact timing, which gives the model more creative freedom in placing individual drum hits.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment