Principle:Alibaba MNN Weight Quantization
| Field | Value |
|---|---|
| Principle Name | Weight_Quantization |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Reducing model size by quantizing weight parameters to lower bit-widths |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Weight quantization is the most widely used post-training compression technique in MNN. It reduces model size by representing floating-point weight parameters with fewer bits (2-8 bits) while preserving the model's inference behavior. Unlike full-graph quantization, weight quantization only compresses the weight storage -- activations remain in floating-point during inference unless dynamic quantization is separately enabled.
This approach is attractive because it requires no calibration data, operates as a single-command transformation, and achieves 75-87% model size reduction depending on the target bit-width.
Theoretical Foundation
Uniform Quantization
The core operation maps a continuous floating-point weight value to a discrete integer representation within a fixed range. For a given bit-width b, the quantization formula is:
quantized = round(clip(w / scale + zero_point, 0, 2^b - 1))
Where:
- w is the original floating-point weight value
- scale is the quantization step size, computed as
(max_w - min_w) / (2^b - 1) - zero_point is the integer value that maps to floating-point zero
- b is the target bit-width (2, 4, 8, etc.)
Dequantization reconstructs an approximation of the original weight:
w_approx = (quantized - zero_point) * scale
Symmetric vs. Asymmetric Quantization
MNN supports two quantization schemes:
- Symmetric quantization (default) -- The zero point is fixed at 0, and the scale is determined by the maximum absolute value of the weight tensor:
scale = max(|w|) / (2^(b-1) - 1). This is simpler and compatible with older MNN versions. - Asymmetric quantization (
--weightQuantAsymmetric) -- Both scale and zero point are computed to minimize the quantization range. This can improve accuracy for weight distributions that are not centered around zero, but requires a newer MNN runtime.
Block-Wise Quantization
Rather than computing a single scale and zero point per output channel (channel-wise quantization), block-wise quantization divides each weight channel into fixed-size blocks and computes separate quantization parameters for each block. This increases the model size slightly (due to additional scale/zero-point storage) but significantly improves accuracy by allowing finer-grained representation.
- Channel-wise (
--weightQuantBlock -1, default) -- One scale per output channel. - Block-wise (
--weightQuantBlock 32-256) -- One scale per block of weights. Smaller blocks yield higher accuracy but larger overhead. Recommended range: 32 to 128.
HQQ (Half-Quadratic Quantization)
HQQ is an advanced quantization method that optimizes the quantization grid to minimize the reconstruction error. Instead of using simple min/max statistics, HQQ formulates quantization as an optimization problem and iteratively refines the scale and zero-point parameters. This approach:
- Increases quantization time compared to standard methods
- Generally improves accuracy, particularly at lower bit-widths (4-bit and below)
- Requires asymmetric quantization (automatically enabled when
--hqqis set)
Bit-Width Selection
The choice of bit-width determines the compression-accuracy trade-off:
- 8-bit -- Safe for virtually all models. ~75% size reduction. Minimal accuracy degradation.
- 4-bit -- Effective for large models (LLMs, large CNNs). ~87.5% size reduction. Noticeable accuracy loss for small models; block-wise quantization and HQQ strongly recommended.
- 2-bit -- Only suitable for very large models with extreme parameter redundancy. ~93.75% size reduction. Significant accuracy loss.
Relationship to Other Principles
- Compression_Tool_Setup -- The
MNNConverttool that implements weight quantization must first be built. - Compression_Strategy_Selection -- Weight quantization is one option within the broader strategy decision framework.
- Dynamic_Quantization -- Enables runtime speed improvements for weight-quantized models.
- Compression_Validation -- Validates that the quantized model meets accuracy requirements.