Implementation:Facebookresearch Audiocraft MusicGen prepare tokens and attributes
Summary
MusicGen._prepare_tokens_and_attributes is a private method that transforms raw user inputs (text descriptions, optional melody waveforms, and optional audio prompts) into the structured ConditioningAttributes objects and optional prompt tokens required by the language model for generation. It handles melody conditioning setup with WavCondition objects and encodes audio prompts into discrete token representations.
API Signature
@torch.no_grad()
def _prepare_tokens_and_attributes(
self,
descriptions: Sequence[Optional[str]],
prompt: Optional[torch.Tensor],
melody_wavs: Optional[MelodyList] = None,
) -> Tuple[List[ConditioningAttributes], Optional[torch.Tensor]]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
descriptions |
Sequence[Optional[str]] |
(required) | A list of text descriptions used as text conditioning. Each element corresponds to one sample in the batch. None entries indicate unconditional generation for that sample.
|
prompt |
Optional[torch.Tensor] |
(required) | A batch of waveforms of shape [B, C, T] used for audio continuation. None if no continuation is desired.
|
melody_wavs |
Optional[MelodyList] |
None |
A list of melody waveform tensors, each of shape [C, T]. Entries can be None for samples that do not use melody conditioning. Only supported by the musicgen-melody model variant.
|
Return Value
| Type | Description |
|---|---|
Tuple[List[ConditioningAttributes], Optional[torch.Tensor]] |
A tuple of: (1) List of ConditioningAttributes objects, one per batch sample, with populated text and wav fields; (2) Optional prompt tokens tensor of shape [B, K, T], or None if no audio prompt was provided.
|
Source Location
- File:
audiocraft/models/musicgen.py, lines 194-249 - Class:
MusicGen(extendsBaseGenModel) - Import: Private method on a
MusicGeninstance (not directly importable)
Internal Workflow
The method proceeds through the following steps:
Step 1: Build Text Conditioning Attributes
Creates a list of ConditioningAttributes objects, one per description, with the text field populated:
attributes = [
ConditioningAttributes(text={'description': description})
for description in descriptions
]
Step 2: Handle Melody/Style Conditioning
If melody_wavs is None, each attribute receives a zero-valued WavCondition placeholder:
attr.wav['self_wav'] = WavCondition(
torch.zeros((1, 1, 1), device=self.device),
torch.tensor([0], device=self.device),
sample_rate=[self.sample_rate],
path=[None]
)
If melody_wavs is provided, the method:
- Validates that the model has a
'self_wav'conditioner (raisesRuntimeErrorif not). - Asserts that the number of melody waveforms matches the number of descriptions.
- For each sample, wraps the melody tensor in a
WavConditionwith proper shape, device placement, length tracking, and sample rate annotation. Samples withNonemelody receive the zero placeholder.
Step 3: Encode Audio Prompt
If prompt is not None:
- Moves the prompt tensor to the model device.
- Encodes the prompt waveform using the compression model:
prompt_tokens, scale = self.compression_model.encode(prompt). - Asserts that
scale is None(the MusicGen EnCodec variant does not use renormalization).
If no prompt is provided, prompt_tokens is set to None.
Key Data Classes
| Class | Location | Fields |
|---|---|---|
ConditioningAttributes |
audiocraft/modules/conditioners.py, lines 78-126 |
text: Dict[str, Optional[str]], wav: Dict[str, WavCondition], joint_embed: Dict[str, JointEmbedCondition], symbolic: Dict[str, SymbolicCondition]
|
WavCondition |
audiocraft/modules/conditioners.py, lines 55-60 |
wav: torch.Tensor, length: torch.Tensor, sample_rate: List[int], path: List[Optional[str]], seek_time: List[Optional[float]]
|
ChromaExtractor |
audiocraft/modules/chroma.py, lines 16-66 |
Extracts 12-bin chromagram features from audio using STFT and librosa chroma filter banks. Used downstream by the ChromaStemConditioner when processing the self_wav condition.
|
Callers
This method is called by the following high-level generation methods:
MusicGen.generate()viaBaseGenModel.generate()-- text-only generation, passesmelody_wavs=None.MusicGen.generate_with_chroma()-- melody-conditioned generation, passes melody waveforms after sample rate conversion.MusicGen.generate_continuation()viaBaseGenModel.generate_continuation()-- continuation generation, passes encoded prompt.
Example Usage
# This method is called internally; typical user code does not call it directly.
# However, for illustration:
# Text-only conditioning (called by model.generate())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
descriptions=['upbeat jazz piano'],
prompt=None,
melody_wavs=None,
)
# attributes: [ConditioningAttributes(text={'description': 'upbeat jazz piano'}, wav={'self_wav': ...})]
# prompt_tokens: None
# Melody conditioning (called by model.generate_with_chroma())
attributes, prompt_tokens = model._prepare_tokens_and_attributes(
descriptions=['upbeat jazz piano'],
prompt=None,
melody_wavs=[melody_tensor], # [C, T] tensor
)
Dependencies
torch- Tensor operations, device placementtransformers- T5 encoder used downstream by the condition providerlibrosa- Chroma filter banks used byChromaExtractortorchaudio- Spectrogram computation inChromaExtractor
Related Pages
- Principle:Facebookresearch_Audiocraft_Conditioning_Preparation
- Implementation:Facebookresearch_Audiocraft_MusicGen_set_generation_params - Parameters that govern how the prepared conditions are used during generation.
- Implementation:Facebookresearch_Audiocraft_LMModel_generate - Consumes the conditioning attributes and prompt tokens produced by this method.
- Implementation:Facebookresearch_Audiocraft_EncodecModel_decode - The compression model whose
encode()method is called here for prompt encoding. - Heuristic:Facebookresearch_Audiocraft_Chroma_Conditioning_Cache_Requirement