Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Facebookresearch Audiocraft SEANet and RVQ

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete implementation of the SEANet encoder-decoder and Residual Vector Quantizer within Audiocraft. The SEANetEncoder and SEANetDecoder classes provide the convolutional backbone for audio compression, while ResidualVectorQuantizer provides the multi-level discrete bottleneck. Together these form the core components of the EnCodec model.

Description

The SEANet encoder is a stack of residual blocks followed by strided convolutions that progressively downsample the input waveform. The decoder mirrors this structure using transposed convolutions for upsampling. The RVQ module wraps an inner ResidualVectorQuantization (from core_vq.py) that applies K layers of VectorQuantization, each with its own EuclideanCodebook.

The encoder reverses the provided ratios internally (so the decoder ratios [8, 5, 4, 2] become encoder downsampling ratios [2, 4, 5, 8]), ensuring symmetric encoder-decoder structure. Each stage doubles the channel count (encoder) or halves it (decoder), starting from n_filters.

Usage

Import when building or inspecting an EnCodec model:

from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

These components are typically instantiated by models.builders.get_compression_model() from a Hydra config, but can also be used directly for custom architectures.

Code Reference

Source Location

  • Repository: facebookresearch/audiocraft
  • File: audiocraft/modules/seanet.py (lines 63--153 for encoder, lines 156--258 for decoder)
  • File: audiocraft/quantization/vq.py (lines 16--115 for ResidualVectorQuantizer)
  • File: audiocraft/quantization/core_vq.py (lines 351--404 for ResidualVectorQuantization, lines 87--219 for EuclideanCodebook)

Signature

class SEANetEncoder(nn.Module):
    def __init__(
        self,
        channels: int = 1,
        dimension: int = 128,
        n_filters: int = 32,
        n_residual_layers: int = 3,
        ratios: List[int] = [8, 5, 4, 2],
        activation: str = 'ELU',
        activation_params: dict = {'alpha': 1.0},
        norm: str = 'none',
        norm_params: Dict[str, Any] = {},
        kernel_size: int = 7,
        last_kernel_size: int = 7,
        residual_kernel_size: int = 3,
        dilation_base: int = 2,
        causal: bool = False,
        pad_mode: str = 'reflect',
        true_skip: bool = True,
        compress: int = 2,
        lstm: int = 0,
        disable_norm_outer_blocks: int = 0,
    ):
        ...

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [B, C, T] -> output: [B, dimension, T']
        ...


class SEANetDecoder(nn.Module):
    def __init__(
        self,
        channels: int = 1,
        dimension: int = 128,
        n_filters: int = 32,
        n_residual_layers: int = 3,
        ratios: List[int] = [8, 5, 4, 2],
        activation: str = 'ELU',
        activation_params: dict = {'alpha': 1.0},
        final_activation: Optional[str] = None,
        final_activation_params: Optional[dict] = None,
        norm: str = 'none',
        norm_params: Dict[str, Any] = {},
        kernel_size: int = 7,
        last_kernel_size: int = 7,
        residual_kernel_size: int = 3,
        dilation_base: int = 2,
        causal: bool = False,
        pad_mode: str = 'reflect',
        true_skip: bool = True,
        compress: int = 2,
        lstm: int = 0,
        disable_norm_outer_blocks: int = 0,
        trim_right_ratio: float = 1.0,
    ):
        ...

    def forward(self, z: torch.Tensor) -> torch.Tensor:
        # z: [B, dimension, T'] -> output: [B, channels, T]
        ...


class ResidualVectorQuantizer(BaseQuantizer):
    def __init__(
        self,
        dimension: int = 256,
        n_q: int = 8,
        q_dropout: bool = False,
        bins: int = 1024,
        decay: float = 0.99,
        kmeans_init: bool = True,
        kmeans_iters: int = 10,
        threshold_ema_dead_code: float = 2.,
        orthogonal_reg_weight: float = 0.0,
        orthogonal_reg_active_codes_only: bool = False,
        orthogonal_reg_max_codes: Optional[int] = None,
    ):
        ...

    def forward(self, x: torch.Tensor, frame_rate: int) -> QuantizedResult:
        ...

    def encode(self, x: torch.Tensor) -> torch.Tensor:
        ...

    def decode(self, codes: torch.Tensor) -> torch.Tensor:
        ...

Import

from audiocraft.modules.seanet import SEANetEncoder, SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

I/O Contract

Inputs

Input Contract
Name Type Description
x (encoder) torch.Tensor [B, C, T] Raw audio waveform. B = batch size, C = audio channels (typically 1 for mono), T = number of samples.
z (decoder) torch.Tensor [B, D, T'] Quantized latent representation. D = dimension, T' = T / prod(ratios).
x (RVQ forward) torch.Tensor [B, D, T'] Continuous encoder output to be quantized.
frame_rate (RVQ forward) int Token frame rate in Hz, used for bandwidth calculation.
codes (RVQ decode) torch.Tensor [B, K, T'] Discrete codes from K codebooks.

Outputs

Output Contract
Name Type Description
Encoder output torch.Tensor [B, D, T'] Continuous latent representation. T' = T / prod(ratios); with default ratios [8,5,4,2], stride = 320, so 32kHz audio yields 100Hz tokens.
Decoder output torch.Tensor [B, C, T] Reconstructed audio waveform at the original sample rate and channel count.
RVQ forward QuantizedResult Named tuple containing: x (quantized tensor [B, D, T']), codes (discrete indices [B, K, T']), bandwidth (tensor, kbps), penalty (commitment loss).
RVQ encode torch.Tensor [B, K, T'] Discrete codebook indices for all K quantizers.

Usage Examples

Example 1: Encoding Audio to Discrete Tokens

Encoding raw audio through the SEANet encoder and RVQ to produce discrete codes.

import torch
from audiocraft.modules.seanet import SEANetEncoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

encoder = SEANetEncoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)

# Raw mono audio at 32kHz, 1 second
audio = torch.randn(1, 1, 32000)

# Encode to continuous latent: [1, 128, 100]
latent = encoder(audio)

# Quantize: produces codes [1, 8, 100] at 100Hz frame rate
qres = quantizer(latent, frame_rate=100)
codes = qres.codes          # [1, 8, 100] -- 8 codebooks, 100 frames
bandwidth = qres.bandwidth  # bandwidth in kbps

Example 2: Decoding Tokens Back to Audio

Reconstructing audio from discrete codes through the RVQ decoder and SEANet decoder.

from audiocraft.modules.seanet import SEANetDecoder
from audiocraft.quantization.vq import ResidualVectorQuantizer

decoder = SEANetDecoder(channels=1, dimension=128, n_filters=32, ratios=[8, 5, 4, 2])
quantizer = ResidualVectorQuantizer(dimension=128, n_q=8, bins=1024)

# Decode discrete codes back to continuous latent
quantized_latent = quantizer.decode(codes)   # [1, 128, 100]

# Decode latent to waveform
reconstructed_audio = decoder(quantized_latent)  # [1, 1, 32000]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment