Implementation:Facebookresearch Audiocraft SEANet and RVQ

**Metadata**
Knowledge Sources	Facebookresearch Audiocraft
Domains	Audio_Compression Neural_Codecs Representation_Learning
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete implementation of the SEANet encoder-decoder and Residual Vector Quantizer within Audiocraft. The SEANetEncoder and SEANetDecoder classes provide the convolutional backbone for audio compression, while ResidualVectorQuantizer provides the multi-level discrete bottleneck. Together these form the core components of the EnCodec model.

Description

The SEANet encoder is a stack of residual blocks followed by strided convolutions that progressively downsample the input waveform. The decoder mirrors this structure using transposed convolutions for upsampling. The RVQ module wraps an inner ResidualVectorQuantization (from core_vq.py) that applies K layers of VectorQuantization, each with its own EuclideanCodebook.

The encoder reverses the provided ratios internally (so the decoder ratios [8, 5, 4, 2] become encoder downsampling ratios [2, 4, 5, 8]), ensuring symmetric encoder-decoder structure. Each stage doubles the channel count (encoder) or halves it (decoder), starting from n_filters.

Usage

Import when building or inspecting an EnCodec model:

from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

These components are typically instantiated by models.builders.get_compression_model() from a Hydra config, but can also be used directly for custom architectures.

Code Reference

Source Location

Repository: facebookresearch/audiocraft
File: audiocraft/modules/seanet.py (lines 63--153 for encoder, lines 156--258 for decoder)
File: audiocraft/quantization/vq.py (lines 16--115 for ResidualVectorQuantizer)
File: audiocraft/quantization/core_vq.py (lines 351--404 for ResidualVectorQuantization, lines 87--219 for EuclideanCodebook)

Signature

class SEANetEncoder(nn.Module):
    def __init__(
        self,
        channels: int = 1,
        dimension: int = 128,
        n_filters: int = 32,
        n_residual_layers: int = 3,
        ratios: List[int] = [8, 5, 4, 2],
        activation: str = 'ELU',
        activation_params: dict = {'alpha': 1.0},
        norm: str = 'none',
        norm_params: Dict[str, Any] = {},
        kernel_size: int = 7,
        last_kernel_size: int = 7,
        residual_kernel_size: int = 3,
        dilation_base: int = 2,
        causal: bool = False,
        pad_mode: str = 'reflect',
        true_skip: bool = True,
        compress: int = 2,
        lstm: int = 0,
        disable_norm_outer_blocks: int = 0,
    ):
        ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [B, C, T] -> output: [B, dimension, T']
        ...


class SEANetDecoder(nn.Module):
    def __init__(
        self,
        channels: int = 1,
        dimension: int = 128,
        n_filters: int = 32,
        n_residual_layers: int = 3,
        ratios: List[int] = [8, 5, 4, 2],
        activation: str = 'ELU',
        activation_params: dict = {'alpha': 1.0},
        final_activation: Optional[str] = None,
        final_activation_params: Optional[dict] = None,
        norm: str = 'none',
        norm_params: Dict[str, Any] = {},
        kernel_size: int = 7,
        last_kernel_size: int = 7,
        residual_kernel_size: int = 3,
        dilation_base: int = 2,
        causal: bool = False,
        pad_mode: str = 'reflect',
        true_skip: bool = True,
        compress: int = 2,
        lstm: int = 0,
        disable_norm_outer_blocks: int = 0,
        trim_right_ratio: float = 1.0,
    ):
        ...

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        # z: [B, dimension, T'] -> output: [B, channels, T]
        ...


class ResidualVectorQuantizer(BaseQuantizer):
    def __init__(
        self,
        dimension: int = 256,
        n_q: int = 8,
        q_dropout: bool = False,
        bins: int = 1024,
        decay: float = 0.99,
        kmeans_init: bool = True,
        kmeans_iters: int = 10,
        threshold_ema_dead_code: float = 2.,
        orthogonal_reg_weight: float = 0.0,
        orthogonal_reg_active_codes_only: bool = False,
        orthogonal_reg_max_codes: Optional[int] = None,
    ):
        ...

    def forward(self, x: torch.Tensor, frame_rate: int) -> QuantizedResult:
        ...

    def encode(self, x: torch.Tensor) -> torch.Tensor:
        ...

    def decode(self, codes: torch.Tensor) -> torch.Tensor:
        ...

Import

from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

I/O Contract

Inputs

**Input Contract**
Name	Type	Description
`x` (encoder)	`torch.Tensor [B, C, T]`	Raw audio waveform. `B` = batch size, `C` = audio channels (typically 1 for mono), `T` = number of samples.
`z` (decoder)	`torch.Tensor [B, D, T']`	Quantized latent representation. `D` = `dimension`, `T'` = `T / prod(ratios)`.
`x` (RVQ forward)	`torch.Tensor [B, D, T']`	Continuous encoder output to be quantized.
`frame_rate` (RVQ forward)	`int`	Token frame rate in Hz, used for bandwidth calculation.
`codes` (RVQ decode)	`torch.Tensor [B, K, T']`	Discrete codes from `K` codebooks.

Outputs

**Output Contract**
Name	Type	Description
Encoder output	`torch.Tensor [B, D, T']`	Continuous latent representation. `T' = T / prod(ratios)`; with default ratios `[8,5,4,2]`, stride = 320, so 32kHz audio yields 100Hz tokens.
Decoder output	`torch.Tensor [B, C, T]`	Reconstructed audio waveform at the original sample rate and channel count.
RVQ forward	`QuantizedResult`	Named tuple containing: `x` (quantized tensor `[B, D, T']`), `codes` (discrete indices `[B, K, T']`), `bandwidth` (tensor, kbps), `penalty` (commitment loss).
RVQ encode	`torch.Tensor [B, K, T']`	Discrete codebook indices for all `K` quantizers.

Usage Examples

Example 1: Encoding Audio to Discrete Tokens

Encoding raw audio through the SEANet encoder and RVQ to produce discrete codes.

import torch
from audiocraft.modules.seanet import SEANetEncoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

encoder = SEANetEncoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)

# Raw mono audio at 32kHz, 1 second
audio = torch.randn(1, 1, 32000)

# Encode to continuous latent: [1, 128, 100]
latent = encoder(audio)

# Quantize: produces codes [1, 8, 100] at 100Hz frame rate
qres = quantizer(latent, frame_rate=100)
codes = qres.codes          # [1, 8, 100] -- 8 codebooks, 100 frames
bandwidth = qres.bandwidth  # bandwidth in kbps

Example 2: Decoding Tokens Back to Audio

Reconstructing audio from discrete codes through the RVQ decoder and SEANet decoder.

from audiocraft.modules.seanet import SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

decoder = SEANetDecoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)

# Decode discrete codes back to continuous latent
quantized_latent = quantizer.decode(codes)   # [1, 128, 100]

# Decode latent to waveform
reconstructed_audio = decoder(quantized_latent)  # [1, 1, 32000]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment