Principle:Zai org CogVideo Finite Scalar Quantization

Knowledge Sources	Finite Scalar Quantization: VQ-VAE Made Simple
Domains	Quantization, Representation_Learning, Autoencoding
Last Updated	2026-02-10 00:00 GMT

Overview

Finite Scalar Quantization (FSQ) is a quantization technique that discretizes each dimension of a latent vector independently into a fixed number of levels, creating an implicit codebook without learned embedding vectors.

Description

Traditional vector quantization (VQ) maintains an explicit codebook of embedding vectors and maps each encoder output to the nearest codebook entry. This approach suffers from codebook collapse (where many entries go unused), requires auxiliary commitment losses, and needs careful codebook initialization and update strategies.

FSQ simplifies this process by treating each dimension of the latent vector independently. Each dimension is bounded to a range determined by its number of quantization levels, then rounded to the nearest integer. The total number of possible discrete codes is the Cartesian product of all per-dimension levels. For example, with levels [8, 6, 5], the implicit codebook contains 8 * 6 * 5 = 240 unique entries.

This approach eliminates codebook collapse entirely because every possible combination of per-dimension quantized values is equally accessible. No auxiliary losses or codebook update mechanisms are needed, making FSQ significantly simpler to implement and train.

Usage

Use Finite Scalar Quantization when building discrete latent representations in autoencoders where training stability and simplicity are priorities. It is especially appropriate when codebook utilization problems arise with standard VQ approaches, or when the complexity of commitment losses and EMA codebook updates is undesirable.

Theoretical Basis

Bounding Function

The bounding operation maps continuous values to a finite range using the hyperbolic tangent function. For a dimension with L levels, the half-width is computed as:

half_l = (L - 1) * (1 + eps) / 2

For even-valued levels, an offset of 0.5 is applied to center the quantization bins symmetrically around zero. The bounding function is:

bound(z) = tanh(z + shift) * half_l - offset

where shift = atanh(offset / half_l).

Straight-Through Estimator

Since rounding is a non-differentiable operation, FSQ uses the straight-through estimator (STE): during the forward pass the value is rounded, but during the backward pass gradients flow through as if rounding had not occurred:

round_ste(z) = z + (round(z) - z).detach()

This allows gradient-based optimization of the encoder despite the discrete quantization step.

Codebook Index Computation

Quantized codes are mapped to flat indices using a mixed-radix representation. Given levels [L_1, L_2, ..., L_d], a basis vector is computed as the cumulative product [1, L_1, L_1*L_2, ...]. The index for a code vector is the dot product of the shifted code with this basis:

index = sum((code_shifted) * basis)

This provides a bijective mapping between quantized code vectors and integers in [0, product(levels)).

Related Pages

Implementation:Zai_org_CogVideo_FSQ

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment