Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Facebookresearch Audiocraft MusicGen prepare tokens and attributes

From Leeroopedia

Summary

MusicGen._prepare_tokens_and_attributes is a private method that transforms raw user inputs (text descriptions, optional melody waveforms, and optional audio prompts) into the structured ConditioningAttributes objects and optional prompt tokens required by the language model for generation. It handles melody conditioning setup with WavCondition objects and encodes audio prompts into discrete token representations.

API Signature

@torch.no_grad()
def _prepare_tokens_and_attributes(
    self,
    descriptions: Sequence[Optional[str]],
    prompt: Optional[torch.Tensor],
    melody_wavs: Optional[MelodyList] = None,
) -> Tuple[List[ConditioningAttributes], Optional[torch.Tensor]]

Parameters

Parameter Type Default Description
descriptions Sequence[Optional[str]] (required) A list of text descriptions used as text conditioning. Each element corresponds to one sample in the batch. None entries indicate unconditional generation for that sample.
prompt Optional[torch.Tensor] (required) A batch of waveforms of shape [B, C, T] used for audio continuation. None if no continuation is desired.
melody_wavs Optional[MelodyList] None A list of melody waveform tensors, each of shape [C, T]. Entries can be None for samples that do not use melody conditioning. Only supported by the musicgen-melody model variant.

Return Value

Type Description
Tuple[List[ConditioningAttributes], Optional[torch.Tensor]] A tuple of: (1) List of ConditioningAttributes objects, one per batch sample, with populated text and wav fields; (2) Optional prompt tokens tensor of shape [B, K, T], or None if no audio prompt was provided.

Source Location

  • File: audiocraft/models/musicgen.py, lines 194-249
  • Class: MusicGen (extends BaseGenModel)
  • Import: Private method on a MusicGen instance (not directly importable)

Internal Workflow

The method proceeds through the following steps:

Step 1: Build Text Conditioning Attributes

Creates a list of ConditioningAttributes objects, one per description, with the text field populated:

attributes = [
    ConditioningAttributes(text={'description': description})
    for description in descriptions
]

Step 2: Handle Melody/Style Conditioning

If melody_wavs is None, each attribute receives a zero-valued WavCondition placeholder:

attr.wav['self_wav'] = WavCondition(
    torch.zeros((1, 1, 1), device=self.device),
    torch.tensor([0], device=self.device),
    sample_rate=[self.sample_rate],
    path=[None]
)

If melody_wavs is provided, the method:

  1. Validates that the model has a 'self_wav' conditioner (raises RuntimeError if not).
  2. Asserts that the number of melody waveforms matches the number of descriptions.
  3. For each sample, wraps the melody tensor in a WavCondition with proper shape, device placement, length tracking, and sample rate annotation. Samples with None melody receive the zero placeholder.

Step 3: Encode Audio Prompt

If prompt is not None:

  1. Moves the prompt tensor to the model device.
  2. Encodes the prompt waveform using the compression model: prompt_tokens, scale = self.compression_model.encode(prompt).
  3. Asserts that scale is None (the MusicGen EnCodec variant does not use renormalization).

If no prompt is provided, prompt_tokens is set to None.

Key Data Classes

Class Location Fields
ConditioningAttributes audiocraft/modules/conditioners.py, lines 78-126 text: Dict[str, Optional[str]], wav: Dict[str, WavCondition], joint_embed: Dict[str, JointEmbedCondition], symbolic: Dict[str, SymbolicCondition]
WavCondition audiocraft/modules/conditioners.py, lines 55-60 wav: torch.Tensor, length: torch.Tensor, sample_rate: List[int], path: List[Optional[str]], seek_time: List[Optional[float]]
ChromaExtractor audiocraft/modules/chroma.py, lines 16-66 Extracts 12-bin chromagram features from audio using STFT and librosa chroma filter banks. Used downstream by the ChromaStemConditioner when processing the self_wav condition.

Callers

This method is called by the following high-level generation methods:

  • MusicGen.generate() via BaseGenModel.generate() -- text-only generation, passes melody_wavs=None.
  • MusicGen.generate_with_chroma() -- melody-conditioned generation, passes melody waveforms after sample rate conversion.
  • MusicGen.generate_continuation() via BaseGenModel.generate_continuation() -- continuation generation, passes encoded prompt.

Example Usage

# This method is called internally; typical user code does not call it directly.
# However, for illustration:

# Text-only conditioning (called by model.generate())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
    descriptions=['upbeat jazz piano'],
    prompt=None,
    melody_wavs=None,
)
# attributes: [ConditioningAttributes(text={'description': 'upbeat jazz piano'}, wav={'self_wav': ...})]
# prompt_tokens: None

# Melody conditioning (called by model.generate_with_chroma())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
    descriptions=['upbeat jazz piano'],
    prompt=None,
    melody_wavs=[melody_tensor],  # [C, T] tensor
)

Dependencies

  • torch - Tensor operations, device placement
  • transformers - T5 encoder used downstream by the condition provider
  • librosa - Chroma filter banks used by ChromaExtractor
  • torchaudio - Spectrogram computation in ChromaExtractor

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment