Implementation:Facebookresearch Audiocraft EncodecModel decode
Summary
EncodecModel.decode converts discrete audio token codes back into continuous audio waveforms. It performs codebook lookup via the residual vector quantizer's decode method to reconstruct continuous latent representations, then passes those latent vectors through the convolutional decoder network to synthesize the output audio signal. An optional scale parameter supports audio denormalization for models trained with input normalization.
API Signature
def decode(self, codes: torch.Tensor, scale: Optional[torch.Tensor] = None) -> torch.Tensor
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
codes |
torch.Tensor |
(required) | Integer tensor of shape [B, K, T] containing discrete audio codes. B is the batch size, K is the number of codebooks (RVQ levels), and T is the number of time frames. Values are indices into the codebook embeddings.
|
scale |
Optional[torch.Tensor] |
None |
Float tensor containing the scale factor for audio denormalization. Only used when the model was configured with renormalize=True. In MusicGen inference, this is typically None.
|
Return Value
| Type | Description |
|---|---|
torch.Tensor |
Decoded audio waveform of shape [B, C, T_audio]. C is the number of audio channels (typically 1 for mono). T_audio is the number of audio samples, which equals approximately T * (sample_rate / frame_rate). Note: the output may contain extra padding samples added by the encoder/decoder architecture.
|
Source Location
- File:
audiocraft/models/encodec.py, lines 240-255 - Class:
EncodecModel(extendsCompressionModel) - Import:
from audiocraft.models.encodec import EncodecModel(typically accessed throughMusicGen.compression_model)
Internal Workflow
The decode method executes three sequential operations:
Step 1: Decode Latent (Codebook Lookup)
emb = self.decode_latent(codes)
Calls self.quantizer.decode(codes) which performs:
- For each codebook , looks up the embedding vectors corresponding to the code indices in
codes[:, k, :]. - Sums the embedding vectors across all codebooks (residual vector quantization reconstruction).
- Returns the reconstructed continuous latent representation.
Step 2: Decoder Network
out = self.decoder(emb)
Passes the continuous latent through the convolutional decoder network, which upsamples from the frame rate to the audio sample rate. The decoder mirrors the encoder architecture with transposed convolutions.
Step 3: Postprocessing (Denormalization)
out = self.postprocess(out, scale)
If scale is not None (and renormalize=True was set during encoding), multiplies the output by the stored scale factor to restore the original volume. In MusicGen inference, this step is a no-op since scale=None.
How It Is Called
In the MusicGen inference pipeline, decoding is invoked through BaseGenModel.generate_audio():
# In audiocraft/models/genmodel.py, lines 262-267
def generate_audio(self, gen_tokens: torch.Tensor) -> torch.Tensor:
"""Generate Audio from tokens."""
assert gen_tokens.dim() == 3
with torch.no_grad():
gen_audio = self.compression_model.decode(gen_tokens, None)
return gen_audio
The scale parameter is always passed as None in the generation pipeline, since MusicGen's EnCodec configuration does not use input renormalization.
Stereo Support
For stereo audio models, MusicGen wraps the EncodecModel in an InterleaveStereoCompressionModel (defined at audiocraft/models/encodec.py, lines 397-506). This wrapper:
- Splits the interleaved stereo codes into left and right channel codes.
- Decodes each channel independently using the underlying mono
EncodecModel.decode(). - Concatenates the decoded channels along the channel dimension.
Example Usage
# Typically called indirectly through model.generate_audio():
gen_tokens = model.lm.generate(prompt_tokens, attributes, max_gen_len=400, **params)
audio = model.compression_model.decode(gen_tokens, None)
# audio shape: [B, C, T_audio]
# Or through the high-level API:
audio = model.generate(['calm piano music'])
# audio shape: [1, 1, T_audio] (mono, single sample)
Dependencies
torch- Core tensor operationseinops- Tensor rearrangement (used in the stereo wrapper)
Related Pages
- Principle:Facebookresearch_Audiocraft_Audio_Token_Decoding
- Implementation:Facebookresearch_Audiocraft_LMModel_generate - Produces the discrete tokens that this method decodes.
- Implementation:Facebookresearch_Audiocraft_Audio_write - Writes the decoded audio waveform to disk.
- Implementation:Facebookresearch_Audiocraft_MusicGen_get_pretrained - Loads the compression model used for decoding.