Implementation:Facebookresearch Audiocraft EncodecModel decode

Summary

EncodecModel.decode converts discrete audio token codes back into continuous audio waveforms. It performs codebook lookup via the residual vector quantizer's decode method to reconstruct continuous latent representations, then passes those latent vectors through the convolutional decoder network to synthesize the output audio signal. An optional scale parameter supports audio denormalization for models trained with input normalization.

API Signature

def decode(self, codes: torch.Tensor, scale: Optional[torch.Tensor] = None) -> torch.Tensor

Parameters

Parameter	Type	Default	Description
`codes`	`torch.Tensor`	(required)	Integer tensor of shape `[B, K, T]` containing discrete audio codes. `B` is the batch size, `K` is the number of codebooks (RVQ levels), and `T` is the number of time frames. Values are indices into the codebook embeddings.
`scale`	`Optional[torch.Tensor]`	`None`	Float tensor containing the scale factor for audio denormalization. Only used when the model was configured with `renormalize=True`. In MusicGen inference, this is typically `None`.

Return Value

Type	Description
`torch.Tensor`	Decoded audio waveform of shape `[B, C, T_audio]`. `C` is the number of audio channels (typically 1 for mono). `T_audio` is the number of audio samples, which equals approximately `T * (sample_rate / frame_rate)`. Note: the output may contain extra padding samples added by the encoder/decoder architecture.

Source Location

File: audiocraft/models/encodec.py, lines 240-255
Class: EncodecModel (extends CompressionModel)
Import: from audiocraft.models.encodec import EncodecModel (typically accessed through MusicGen.compression_model)

Internal Workflow

The decode method executes three sequential operations:

Step 1: Decode Latent (Codebook Lookup)

emb = self.decode_latent(codes)

Calls self.quantizer.decode(codes) which performs:

For each codebook $k$ , looks up the embedding vectors corresponding to the code indices in codes[:, k, :].
Sums the embedding vectors across all $K$ codebooks (residual vector quantization reconstruction).
Returns the reconstructed continuous latent representation.

Step 2: Decoder Network

out = self.decoder(emb)

Passes the continuous latent through the convolutional decoder network, which upsamples from the frame rate to the audio sample rate. The decoder mirrors the encoder architecture with transposed convolutions.

Step 3: Postprocessing (Denormalization)

out = self.postprocess(out, scale)

If scale is not None (and renormalize=True was set during encoding), multiplies the output by the stored scale factor to restore the original volume. In MusicGen inference, this step is a no-op since scale=None.

How It Is Called

In the MusicGen inference pipeline, decoding is invoked through BaseGenModel.generate_audio():

# In audiocraft/models/genmodel.py, lines 262-267
def generate_audio(self, gen_tokens: torch.Tensor) -> torch.Tensor:
    """Generate Audio from tokens."""
    assert gen_tokens.dim() == 3
    with torch.no_grad():
        gen_audio = self.compression_model.decode(gen_tokens, None)
    return gen_audio

The scale parameter is always passed as None in the generation pipeline, since MusicGen's EnCodec configuration does not use input renormalization.

Stereo Support

For stereo audio models, MusicGen wraps the EncodecModel in an InterleaveStereoCompressionModel (defined at audiocraft/models/encodec.py, lines 397-506). This wrapper:

Splits the interleaved stereo codes into left and right channel codes.
Decodes each channel independently using the underlying mono EncodecModel.decode().
Concatenates the decoded channels along the channel dimension.

Example Usage

# Typically called indirectly through model.generate_audio():
gen_tokens = model.lm.generate(prompt_tokens, attributes, max_gen_len=400, **params)
audio = model.compression_model.decode(gen_tokens, None)
# audio shape: [B, C, T_audio]

# Or through the high-level API:
audio = model.generate(['calm piano music'])
# audio shape: [1, 1, T_audio]  (mono, single sample)

Dependencies

torch - Core tensor operations
einops - Tensor rearrangement (used in the stereo wrapper)

Related Pages

Principle:Facebookresearch_Audiocraft_Audio_Token_Decoding
Implementation:Facebookresearch_Audiocraft_LMModel_generate - Produces the discrete tokens that this method decodes.
Implementation:Facebookresearch_Audiocraft_Audio_write - Writes the decoded audio waveform to disk.
Implementation:Facebookresearch_Audiocraft_MusicGen_get_pretrained - Loads the compression model used for decoding.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment