Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Facebookresearch Audiocraft EncodecModel decode

From Leeroopedia

Summary

EncodecModel.decode converts discrete audio token codes back into continuous audio waveforms. It performs codebook lookup via the residual vector quantizer's decode method to reconstruct continuous latent representations, then passes those latent vectors through the convolutional decoder network to synthesize the output audio signal. An optional scale parameter supports audio denormalization for models trained with input normalization.

API Signature

def decode(self, codes: torch.Tensor, scale: Optional[torch.Tensor] = None) -> torch.Tensor

Parameters

Parameter Type Default Description
codes torch.Tensor (required) Integer tensor of shape [B, K, T] containing discrete audio codes. B is the batch size, K is the number of codebooks (RVQ levels), and T is the number of time frames. Values are indices into the codebook embeddings.
scale Optional[torch.Tensor] None Float tensor containing the scale factor for audio denormalization. Only used when the model was configured with renormalize=True. In MusicGen inference, this is typically None.

Return Value

Type Description
torch.Tensor Decoded audio waveform of shape [B, C, T_audio]. C is the number of audio channels (typically 1 for mono). T_audio is the number of audio samples, which equals approximately T * (sample_rate / frame_rate). Note: the output may contain extra padding samples added by the encoder/decoder architecture.

Source Location

  • File: audiocraft/models/encodec.py, lines 240-255
  • Class: EncodecModel (extends CompressionModel)
  • Import: from audiocraft.models.encodec import EncodecModel (typically accessed through MusicGen.compression_model)

Internal Workflow

The decode method executes three sequential operations:

Step 1: Decode Latent (Codebook Lookup)

emb = self.decode_latent(codes)

Calls self.quantizer.decode(codes) which performs:

  1. For each codebook k, looks up the embedding vectors corresponding to the code indices in codes[:, k, :].
  2. Sums the embedding vectors across all K codebooks (residual vector quantization reconstruction).
  3. Returns the reconstructed continuous latent representation.

Step 2: Decoder Network

out = self.decoder(emb)

Passes the continuous latent through the convolutional decoder network, which upsamples from the frame rate to the audio sample rate. The decoder mirrors the encoder architecture with transposed convolutions.

Step 3: Postprocessing (Denormalization)

out = self.postprocess(out, scale)

If scale is not None (and renormalize=True was set during encoding), multiplies the output by the stored scale factor to restore the original volume. In MusicGen inference, this step is a no-op since scale=None.

How It Is Called

In the MusicGen inference pipeline, decoding is invoked through BaseGenModel.generate_audio():

# In audiocraft/models/genmodel.py, lines 262-267
def generate_audio(self, gen_tokens: torch.Tensor) -> torch.Tensor:
    """Generate Audio from tokens."""
    assert gen_tokens.dim() == 3
    with torch.no_grad():
        gen_audio = self.compression_model.decode(gen_tokens, None)
    return gen_audio

The scale parameter is always passed as None in the generation pipeline, since MusicGen's EnCodec configuration does not use input renormalization.

Stereo Support

For stereo audio models, MusicGen wraps the EncodecModel in an InterleaveStereoCompressionModel (defined at audiocraft/models/encodec.py, lines 397-506). This wrapper:

  1. Splits the interleaved stereo codes into left and right channel codes.
  2. Decodes each channel independently using the underlying mono EncodecModel.decode().
  3. Concatenates the decoded channels along the channel dimension.

Example Usage

# Typically called indirectly through model.generate_audio():
gen_tokens = model.lm.generate(prompt_tokens, attributes, max_gen_len=400, **params)
audio = model.compression_model.decode(gen_tokens, None)
# audio shape: [B, C, T_audio]

# Or through the high-level API:
audio = model.generate(['calm piano music'])
# audio shape: [1, 1, T_audio]  (mono, single sample)

Dependencies

  • torch - Core tensor operations
  • einops - Tensor rearrangement (used in the stereo wrapper)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment