Implementation:Facebookresearch Audiocraft MusicGen prepare tokens and attributes

Summary

MusicGen._prepare_tokens_and_attributes is a private method that transforms raw user inputs (text descriptions, optional melody waveforms, and optional audio prompts) into the structured ConditioningAttributes objects and optional prompt tokens required by the language model for generation. It handles melody conditioning setup with WavCondition objects and encodes audio prompts into discrete token representations.

API Signature

@torch.no_grad()
def _prepare_tokens_and_attributes(
    self,
    descriptions: Sequence[Optional[str]],
    prompt: Optional[torch.Tensor],
    melody_wavs: Optional[MelodyList] = None,
) -> Tuple[List[ConditioningAttributes], Optional[torch.Tensor]]

Parameters

Parameter	Type	Default	Description
`descriptions`	`Sequence[Optional[str]]`	(required)	A list of text descriptions used as text conditioning. Each element corresponds to one sample in the batch. `None` entries indicate unconditional generation for that sample.
`prompt`	`Optional[torch.Tensor]`	(required)	A batch of waveforms of shape `[B, C, T]` used for audio continuation. `None` if no continuation is desired.
`melody_wavs`	`Optional[MelodyList]`	`None`	A list of melody waveform tensors, each of shape `[C, T]`. Entries can be `None` for samples that do not use melody conditioning. Only supported by the `musicgen-melody` model variant.

Return Value

Type	Description
`Tuple[List[ConditioningAttributes], Optional[torch.Tensor]]`	A tuple of: (1) List of `ConditioningAttributes` objects, one per batch sample, with populated `text` and `wav` fields; (2) Optional prompt tokens tensor of shape `[B, K, T]`, or `None` if no audio prompt was provided.

Source Location

File: audiocraft/models/musicgen.py, lines 194-249
Class: MusicGen (extends BaseGenModel)
Import: Private method on a MusicGen instance (not directly importable)

Internal Workflow

The method proceeds through the following steps:

Step 1: Build Text Conditioning Attributes

Creates a list of ConditioningAttributes objects, one per description, with the text field populated:

attributes = [
    ConditioningAttributes(text={'description': description})
    for description in descriptions
]

Step 2: Handle Melody/Style Conditioning

If melody_wavs is None, each attribute receives a zero-valued WavCondition placeholder:

attr.wav['self_wav'] = WavCondition(
    torch.zeros((1, 1, 1), device=self.device),
    torch.tensor([0], device=self.device),
    sample_rate=[self.sample_rate],
    path=[None]
)

If melody_wavs is provided, the method:

Validates that the model has a 'self_wav' conditioner (raises RuntimeError if not).
Asserts that the number of melody waveforms matches the number of descriptions.
For each sample, wraps the melody tensor in a WavCondition with proper shape, device placement, length tracking, and sample rate annotation. Samples with None melody receive the zero placeholder.

Step 3: Encode Audio Prompt

If prompt is not None:

Moves the prompt tensor to the model device.
Encodes the prompt waveform using the compression model: prompt_tokens, scale = self.compression_model.encode(prompt).
Asserts that scale is None (the MusicGen EnCodec variant does not use renormalization).

If no prompt is provided, prompt_tokens is set to None.

Key Data Classes

Class	Location	Fields
`ConditioningAttributes`	`audiocraft/modules/conditioners.py`, lines 78-126	`text: Dict[str, Optional[str]]`, `wav: Dict[str, WavCondition]`, `joint_embed: Dict[str, JointEmbedCondition]`, `symbolic: Dict[str, SymbolicCondition]`
`WavCondition`	`audiocraft/modules/conditioners.py`, lines 55-60	`wav: torch.Tensor`, `length: torch.Tensor`, `sample_rate: List[int]`, `path: List[Optional[str]]`, `seek_time: List[Optional[float]]`
`ChromaExtractor`	`audiocraft/modules/chroma.py`, lines 16-66	Extracts 12-bin chromagram features from audio using STFT and librosa chroma filter banks. Used downstream by the `ChromaStemConditioner` when processing the `self_wav` condition.

Callers

This method is called by the following high-level generation methods:

MusicGen.generate() via BaseGenModel.generate() -- text-only generation, passes melody_wavs=None.
MusicGen.generate_with_chroma() -- melody-conditioned generation, passes melody waveforms after sample rate conversion.
MusicGen.generate_continuation() via BaseGenModel.generate_continuation() -- continuation generation, passes encoded prompt.

Example Usage

# This method is called internally; typical user code does not call it directly.
# However, for illustration:

# Text-only conditioning (called by model.generate())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
    descriptions=['upbeat jazz piano'],
    prompt=None,
    melody_wavs=None,
)
# attributes: [ConditioningAttributes(text={'description': 'upbeat jazz piano'}, wav={'self_wav': ...})]
# prompt_tokens: None

# Melody conditioning (called by model.generate_with_chroma())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
    descriptions=['upbeat jazz piano'],
    prompt=None,
    melody_wavs=[melody_tensor],  # [C, T] tensor
)

Dependencies

torch - Tensor operations, device placement
transformers - T5 encoder used downstream by the condition provider
librosa - Chroma filter banks used by ChromaExtractor
torchaudio - Spectrogram computation in ChromaExtractor

Related Pages

Principle:Facebookresearch_Audiocraft_Conditioning_Preparation
Implementation:Facebookresearch_Audiocraft_MusicGen_set_generation_params - Parameters that govern how the prepared conditions are used during generation.
Implementation:Facebookresearch_Audiocraft_LMModel_generate - Consumes the conditioning attributes and prompt tokens produced by this method.
Implementation:Facebookresearch_Audiocraft_EncodecModel_decode - The compression model whose encode() method is called here for prompt encoding.
Heuristic:Facebookresearch_Audiocraft_Chroma_Conditioning_Cache_Requirement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment