Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Facebookresearch Audiocraft Encoder Decoder Architecture

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

A convolutional encoder-decoder architecture (SEANet) paired with Residual Vector Quantization (RVQ) for learning compact discrete audio representations. The encoder maps raw audio waveforms to a low-dimensional continuous latent space, the RVQ module discretizes this latent into a hierarchy of codebook indices, and the decoder reconstructs audio from the quantized representation. This forms the core of the EnCodec neural audio codec.

Description

The encoder-decoder with RVQ design follows the VQ-VAE family of models but replaces single-level vector quantization with a multi-level residual scheme. The architecture comprises three tightly coupled components:

  • SEANet Encoder -- a fully convolutional network that progressively downsamples the input waveform through strided convolutions. Each downsampling stage is preceded by residual blocks with dilated convolutions that capture multi-scale temporal patterns. The encoder maps an input of shape [B, C, T] to a continuous latent representation of shape [B, D, T'], where T' = T / prod(ratios).
  • Residual Vector Quantizer (RVQ) -- applies K quantizers sequentially to the encoder output. The first quantizer operates on the raw latent; each subsequent quantizer operates on the residual left by the previous one. This hierarchical approach captures progressively finer details of the signal.
  • SEANet Decoder -- mirrors the encoder using transposed convolutions for upsampling. It reconstructs the audio waveform from the quantized latent representation, producing output of the same shape as the original input.

The total downsampling/upsampling factor is determined by the product of the stride ratios. With the default ratios [8, 5, 4, 2], the total stride is 320, meaning a 32kHz input produces tokens at a 100Hz frame rate.

Usage

This architecture is used whenever a discrete tokenization of audio is required. It is the foundational model for:

  • Training the EnCodec audio codec for high-fidelity audio compression at various bitrates
  • Providing discrete audio tokens as input to autoregressive language models such as MusicGen and AudioGen
  • Streaming audio compression with causal convolutions

The number of active RVQ codebooks K directly controls the bitrate: each codebook contributes log2(bins) * frame_rate / 1000 kbps to the total bandwidth.

Theoretical Basis

VQ-VAE and Residual Vector Quantization

The architecture builds on the VQ-VAE framework (van den Oord et al., 2017), which learns discrete representations by interposing a vector quantization bottleneck between an encoder and decoder. EnCodec extends this with Residual Vector Quantization (Zeghidour et al., 2021), which achieves higher fidelity at the same codebook size by decomposing the quantization into multiple stages.

The RVQ procedure applies K quantizers sequentially. Let z denote the encoder output. The reconstruction is:

RVQ with K quantizers:
    r_0 = z                                    (initial residual is the encoder output)
    r_k = r_{k-1} - VQ_{k-1}(r_{k-1})         (residual after k-th quantizer)
    z_hat = sum_{k=0}^{K-1} VQ_k(r_k)         (reconstructed latent)

Each VQ_k maps its input to the nearest codebook vector:
    VQ_k(x) = argmin_{c in C_k} ||x - c||^2

Bandwidth:
    bw = K * log2(|C|) * frame_rate / 1000     (kbps)

Key properties of this formulation:

  • Progressive refinement -- each subsequent quantizer captures finer details that the previous quantizers could not represent, analogous to a greedy pursuit algorithm.
  • Scalable bitrate -- bandwidth is controlled by selecting the number of active codebooks K at inference time without retraining, since quantizer dropout (q_dropout) during training exposes the model to varying numbers of quantizers.
  • Straight-through estimator (STE) -- gradients flow through the non-differentiable quantization step via the STE trick: z_hat = z + (z_hat - z).detach(), allowing the encoder and decoder to be trained end-to-end.

SEANet Architecture

SEANet (Tagliasacchi et al.) is a convolutional encoder-decoder designed for audio generation tasks. Key design choices include:

  • Strided convolutions with kernel sizes equal to 2 * stride for downsampling and upsampling
  • Residual blocks with exponentially increasing dilations (dilation_base^j for layer j), providing a growing receptive field without excessive parameter count
  • Optional LSTM layers at the bottleneck for modeling long-range temporal dependencies
  • Optional causal convolutions enabling streaming inference with constant latency

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment