Principle:Alibaba MNN Weight Quantization

Field	Value
Principle Name	Weight_Quantization
Topic	Model_Compression
Workflow	Model_Compression
Description	Reducing model size by quantizing weight parameters to lower bit-widths
Last Updated	2026-02-10 14:00 GMT

Overview

Weight quantization is the most widely used post-training compression technique in MNN. It reduces model size by representing floating-point weight parameters with fewer bits (2-8 bits) while preserving the model's inference behavior. Unlike full-graph quantization, weight quantization only compresses the weight storage -- activations remain in floating-point during inference unless dynamic quantization is separately enabled.

This approach is attractive because it requires no calibration data, operates as a single-command transformation, and achieves 75-87% model size reduction depending on the target bit-width.

Theoretical Foundation

Uniform Quantization

The core operation maps a continuous floating-point weight value to a discrete integer representation within a fixed range. For a given bit-width b, the quantization formula is:

quantized = round(clip(w / scale + zero_point, 0, 2^b - 1))

Where:

w is the original floating-point weight value
scale is the quantization step size, computed as (max_w - min_w) / (2^b - 1)
zero_point is the integer value that maps to floating-point zero
b is the target bit-width (2, 4, 8, etc.)

Dequantization reconstructs an approximation of the original weight:

w_approx = (quantized - zero_point) * scale

Symmetric vs. Asymmetric Quantization

MNN supports two quantization schemes:

Symmetric quantization (default) -- The zero point is fixed at 0, and the scale is determined by the maximum absolute value of the weight tensor: scale = max(|w|) / (2^(b-1) - 1). This is simpler and compatible with older MNN versions.
Asymmetric quantization (--weightQuantAsymmetric) -- Both scale and zero point are computed to minimize the quantization range. This can improve accuracy for weight distributions that are not centered around zero, but requires a newer MNN runtime.

Block-Wise Quantization

Rather than computing a single scale and zero point per output channel (channel-wise quantization), block-wise quantization divides each weight channel into fixed-size blocks and computes separate quantization parameters for each block. This increases the model size slightly (due to additional scale/zero-point storage) but significantly improves accuracy by allowing finer-grained representation.

Channel-wise (--weightQuantBlock -1, default) -- One scale per output channel.
Block-wise (--weightQuantBlock 32-256) -- One scale per block of weights. Smaller blocks yield higher accuracy but larger overhead. Recommended range: 32 to 128.

HQQ (Half-Quadratic Quantization)

HQQ is an advanced quantization method that optimizes the quantization grid to minimize the reconstruction error. Instead of using simple min/max statistics, HQQ formulates quantization as an optimization problem and iteratively refines the scale and zero-point parameters. This approach:

Increases quantization time compared to standard methods
Generally improves accuracy, particularly at lower bit-widths (4-bit and below)
Requires asymmetric quantization (automatically enabled when --hqq is set)

Bit-Width Selection

The choice of bit-width determines the compression-accuracy trade-off:

8-bit -- Safe for virtually all models. ~75% size reduction. Minimal accuracy degradation.
4-bit -- Effective for large models (LLMs, large CNNs). ~87.5% size reduction. Noticeable accuracy loss for small models; block-wise quantization and HQQ strongly recommended.
2-bit -- Only suitable for very large models with extreme parameter redundancy. ~93.75% size reduction. Significant accuracy loss.

Relationship to Other Principles

Compression_Tool_Setup -- The MNNConvert tool that implements weight quantization must first be built.
Compression_Strategy_Selection -- Weight quantization is one option within the broader strategy decision framework.
Dynamic_Quantization -- Enables runtime speed improvements for weight-quantized models.
Compression_Validation -- Validates that the quantized model meets accuracy requirements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment