Implementation:Facebookresearch Audiocraft SEANet and RVQ
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete implementation of the SEANet encoder-decoder and Residual Vector Quantizer within Audiocraft. The SEANetEncoder and SEANetDecoder classes provide the convolutional backbone for audio compression, while ResidualVectorQuantizer provides the multi-level discrete bottleneck. Together these form the core components of the EnCodec model.
Description
The SEANet encoder is a stack of residual blocks followed by strided convolutions that progressively downsample the input waveform. The decoder mirrors this structure using transposed convolutions for upsampling. The RVQ module wraps an inner ResidualVectorQuantization (from core_vq.py) that applies K layers of VectorQuantization, each with its own EuclideanCodebook.
The encoder reverses the provided ratios internally (so the decoder ratios [8, 5, 4, 2] become encoder downsampling ratios [2, 4, 5, 8]), ensuring symmetric encoder-decoder structure. Each stage doubles the channel count (encoder) or halves it (decoder), starting from n_filters.
Usage
Import when building or inspecting an EnCodec model:
from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer
These components are typically instantiated by models.builders.get_compression_model() from a Hydra config, but can also be used directly for custom architectures.
Code Reference
Source Location
- Repository:
facebookresearch/audiocraft - File:
audiocraft/modules/seanet.py(lines 63--153 for encoder, lines 156--258 for decoder) - File:
audiocraft/quantization/vq.py(lines 16--115 for ResidualVectorQuantizer) - File:
audiocraft/quantization/core_vq.py(lines 351--404 for ResidualVectorQuantization, lines 87--219 for EuclideanCodebook)
Signature
class SEANetEncoder(nn.Module):
def __init__(
self,
channels: int = 1,
dimension: int = 128,
n_filters: int = 32,
n_residual_layers: int = 3,
ratios: List[int] = [8, 5, 4, 2],
activation: str = 'ELU',
activation_params: dict = {'alpha': 1.0},
norm: str = 'none',
norm_params: Dict[str, Any] = {},
kernel_size: int = 7,
last_kernel_size: int = 7,
residual_kernel_size: int = 3,
dilation_base: int = 2,
causal: bool = False,
pad_mode: str = 'reflect',
true_skip: bool = True,
compress: int = 2,
lstm: int = 0,
disable_norm_outer_blocks: int = 0,
):
...
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [B, C, T] -> output: [B, dimension, T']
...
class SEANetDecoder(nn.Module):
def __init__(
self,
channels: int = 1,
dimension: int = 128,
n_filters: int = 32,
n_residual_layers: int = 3,
ratios: List[int] = [8, 5, 4, 2],
activation: str = 'ELU',
activation_params: dict = {'alpha': 1.0},
final_activation: Optional[str] = None,
final_activation_params: Optional[dict] = None,
norm: str = 'none',
norm_params: Dict[str, Any] = {},
kernel_size: int = 7,
last_kernel_size: int = 7,
residual_kernel_size: int = 3,
dilation_base: int = 2,
causal: bool = False,
pad_mode: str = 'reflect',
true_skip: bool = True,
compress: int = 2,
lstm: int = 0,
disable_norm_outer_blocks: int = 0,
trim_right_ratio: float = 1.0,
):
...
def forward(self, z: torch.Tensor) -> torch.Tensor:
# z: [B, dimension, T'] -> output: [B, channels, T]
...
class ResidualVectorQuantizer(BaseQuantizer):
def __init__(
self,
dimension: int = 256,
n_q: int = 8,
q_dropout: bool = False,
bins: int = 1024,
decay: float = 0.99,
kmeans_init: bool = True,
kmeans_iters: int = 10,
threshold_ema_dead_code: float = 2.,
orthogonal_reg_weight: float = 0.0,
orthogonal_reg_active_codes_only: bool = False,
orthogonal_reg_max_codes: Optional[int] = None,
):
...
def forward(self, x: torch.Tensor, frame_rate: int) -> QuantizedResult:
...
def encode(self, x: torch.Tensor) -> torch.Tensor:
...
def decode(self, codes: torch.Tensor) -> torch.Tensor:
...
Import
from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
x (encoder) |
torch.Tensor [B, C, T] |
Raw audio waveform. B = batch size, C = audio channels (typically 1 for mono), T = number of samples.
|
z (decoder) |
torch.Tensor [B, D, T'] |
Quantized latent representation. D = dimension, T' = T / prod(ratios).
|
x (RVQ forward) |
torch.Tensor [B, D, T'] |
Continuous encoder output to be quantized. |
frame_rate (RVQ forward) |
int |
Token frame rate in Hz, used for bandwidth calculation. |
codes (RVQ decode) |
torch.Tensor [B, K, T'] |
Discrete codes from K codebooks.
|
Outputs
| Name | Type | Description |
|---|---|---|
| Encoder output | torch.Tensor [B, D, T'] |
Continuous latent representation. T' = T / prod(ratios); with default ratios [8,5,4,2], stride = 320, so 32kHz audio yields 100Hz tokens.
|
| Decoder output | torch.Tensor [B, C, T] |
Reconstructed audio waveform at the original sample rate and channel count. |
| RVQ forward | QuantizedResult |
Named tuple containing: x (quantized tensor [B, D, T']), codes (discrete indices [B, K, T']), bandwidth (tensor, kbps), penalty (commitment loss).
|
| RVQ encode | torch.Tensor [B, K, T'] |
Discrete codebook indices for all K quantizers.
|
Usage Examples
Example 1: Encoding Audio to Discrete Tokens
Encoding raw audio through the SEANet encoder and RVQ to produce discrete codes.
import torch
from audiocraft.modules.seanet import SEANetEncoder
from audiocraft.quantization.vq import ResidualVectorQuantizer
encoder = SEANetEncoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)
# Raw mono audio at 32kHz, 1 second
audio = torch.randn(1, 1, 32000)
# Encode to continuous latent: [1, 128, 100]
latent = encoder(audio)
# Quantize: produces codes [1, 8, 100] at 100Hz frame rate
qres = quantizer(latent, frame_rate=100)
codes = qres.codes # [1, 8, 100] -- 8 codebooks, 100 frames
bandwidth = qres.bandwidth # bandwidth in kbps
Example 2: Decoding Tokens Back to Audio
Reconstructing audio from discrete codes through the RVQ decoder and SEANet decoder.
from audiocraft.modules.seanet import SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer
decoder = SEANetDecoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)
# Decode discrete codes back to continuous latent
quantized_latent = quantizer.decode(codes) # [1, 128, 100]
# Decode latent to waveform
reconstructed_audio = decoder(quantized_latent) # [1, 1, 32000]