Principle:Facebookresearch Audiocraft Encoder Decoder Architecture
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A convolutional encoder-decoder architecture (SEANet) paired with Residual Vector Quantization (RVQ) for learning compact discrete audio representations. The encoder maps raw audio waveforms to a low-dimensional continuous latent space, the RVQ module discretizes this latent into a hierarchy of codebook indices, and the decoder reconstructs audio from the quantized representation. This forms the core of the EnCodec neural audio codec.
Description
The encoder-decoder with RVQ design follows the VQ-VAE family of models but replaces single-level vector quantization with a multi-level residual scheme. The architecture comprises three tightly coupled components:
- SEANet Encoder -- a fully convolutional network that progressively downsamples the input waveform through strided convolutions. Each downsampling stage is preceded by residual blocks with dilated convolutions that capture multi-scale temporal patterns. The encoder maps an input of shape
[B, C, T]to a continuous latent representation of shape[B, D, T'], whereT' = T / prod(ratios).
- Residual Vector Quantizer (RVQ) -- applies
Kquantizers sequentially to the encoder output. The first quantizer operates on the raw latent; each subsequent quantizer operates on the residual left by the previous one. This hierarchical approach captures progressively finer details of the signal.
- SEANet Decoder -- mirrors the encoder using transposed convolutions for upsampling. It reconstructs the audio waveform from the quantized latent representation, producing output of the same shape as the original input.
The total downsampling/upsampling factor is determined by the product of the stride ratios. With the default ratios [8, 5, 4, 2], the total stride is 320, meaning a 32kHz input produces tokens at a 100Hz frame rate.
Usage
This architecture is used whenever a discrete tokenization of audio is required. It is the foundational model for:
- Training the EnCodec audio codec for high-fidelity audio compression at various bitrates
- Providing discrete audio tokens as input to autoregressive language models such as MusicGen and AudioGen
- Streaming audio compression with causal convolutions
The number of active RVQ codebooks K directly controls the bitrate: each codebook contributes log2(bins) * frame_rate / 1000 kbps to the total bandwidth.
Theoretical Basis
VQ-VAE and Residual Vector Quantization
The architecture builds on the VQ-VAE framework (van den Oord et al., 2017), which learns discrete representations by interposing a vector quantization bottleneck between an encoder and decoder. EnCodec extends this with Residual Vector Quantization (Zeghidour et al., 2021), which achieves higher fidelity at the same codebook size by decomposing the quantization into multiple stages.
The RVQ procedure applies K quantizers sequentially. Let z denote the encoder output. The reconstruction is:
RVQ with K quantizers:
r_0 = z (initial residual is the encoder output)
r_k = r_{k-1} - VQ_{k-1}(r_{k-1}) (residual after k-th quantizer)
z_hat = sum_{k=0}^{K-1} VQ_k(r_k) (reconstructed latent)
Each VQ_k maps its input to the nearest codebook vector:
VQ_k(x) = argmin_{c in C_k} ||x - c||^2
Bandwidth:
bw = K * log2(|C|) * frame_rate / 1000 (kbps)
Key properties of this formulation:
- Progressive refinement -- each subsequent quantizer captures finer details that the previous quantizers could not represent, analogous to a greedy pursuit algorithm.
- Scalable bitrate -- bandwidth is controlled by selecting the number of active codebooks
Kat inference time without retraining, since quantizer dropout (q_dropout) during training exposes the model to varying numbers of quantizers. - Straight-through estimator (STE) -- gradients flow through the non-differentiable quantization step via the STE trick:
z_hat = z + (z_hat - z).detach(), allowing the encoder and decoder to be trained end-to-end.
SEANet Architecture
SEANet (Tagliasacchi et al.) is a convolutional encoder-decoder designed for audio generation tasks. Key design choices include:
- Strided convolutions with kernel sizes equal to
2 * stridefor downsampling and upsampling - Residual blocks with exponentially increasing dilations (
dilation_base^jfor layerj), providing a growing receptive field without excessive parameter count - Optional LSTM layers at the bottleneck for modeling long-range temporal dependencies
- Optional causal convolutions enabling streaming inference with constant latency