Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Alibaba MNN Weight Quantization

From Leeroopedia


Field Value
Principle Name Weight_Quantization
Topic Model_Compression
Workflow Model_Compression
Description Reducing model size by quantizing weight parameters to lower bit-widths
Last Updated 2026-02-10 14:00 GMT

Overview

Weight quantization is the most widely used post-training compression technique in MNN. It reduces model size by representing floating-point weight parameters with fewer bits (2-8 bits) while preserving the model's inference behavior. Unlike full-graph quantization, weight quantization only compresses the weight storage -- activations remain in floating-point during inference unless dynamic quantization is separately enabled.

This approach is attractive because it requires no calibration data, operates as a single-command transformation, and achieves 75-87% model size reduction depending on the target bit-width.

Theoretical Foundation

Uniform Quantization

The core operation maps a continuous floating-point weight value to a discrete integer representation within a fixed range. For a given bit-width b, the quantization formula is:

quantized = round(clip(w / scale + zero_point, 0, 2^b - 1))

Where:

  • w is the original floating-point weight value
  • scale is the quantization step size, computed as (max_w - min_w) / (2^b - 1)
  • zero_point is the integer value that maps to floating-point zero
  • b is the target bit-width (2, 4, 8, etc.)

Dequantization reconstructs an approximation of the original weight:

w_approx = (quantized - zero_point) * scale

Symmetric vs. Asymmetric Quantization

MNN supports two quantization schemes:

  • Symmetric quantization (default) -- The zero point is fixed at 0, and the scale is determined by the maximum absolute value of the weight tensor: scale = max(|w|) / (2^(b-1) - 1). This is simpler and compatible with older MNN versions.
  • Asymmetric quantization (--weightQuantAsymmetric) -- Both scale and zero point are computed to minimize the quantization range. This can improve accuracy for weight distributions that are not centered around zero, but requires a newer MNN runtime.

Block-Wise Quantization

Rather than computing a single scale and zero point per output channel (channel-wise quantization), block-wise quantization divides each weight channel into fixed-size blocks and computes separate quantization parameters for each block. This increases the model size slightly (due to additional scale/zero-point storage) but significantly improves accuracy by allowing finer-grained representation.

  • Channel-wise (--weightQuantBlock -1, default) -- One scale per output channel.
  • Block-wise (--weightQuantBlock 32-256) -- One scale per block of weights. Smaller blocks yield higher accuracy but larger overhead. Recommended range: 32 to 128.

HQQ (Half-Quadratic Quantization)

HQQ is an advanced quantization method that optimizes the quantization grid to minimize the reconstruction error. Instead of using simple min/max statistics, HQQ formulates quantization as an optimization problem and iteratively refines the scale and zero-point parameters. This approach:

  • Increases quantization time compared to standard methods
  • Generally improves accuracy, particularly at lower bit-widths (4-bit and below)
  • Requires asymmetric quantization (automatically enabled when --hqq is set)

Bit-Width Selection

The choice of bit-width determines the compression-accuracy trade-off:

  • 8-bit -- Safe for virtually all models. ~75% size reduction. Minimal accuracy degradation.
  • 4-bit -- Effective for large models (LLMs, large CNNs). ~87.5% size reduction. Noticeable accuracy loss for small models; block-wise quantization and HQQ strongly recommended.
  • 2-bit -- Only suitable for very large models with extreme parameter redundancy. ~93.75% size reduction. Significant accuracy loss.

Relationship to Other Principles

  • Compression_Tool_Setup -- The MNNConvert tool that implements weight quantization must first be built.
  • Compression_Strategy_Selection -- Weight quantization is one option within the broader strategy decision framework.
  • Dynamic_Quantization -- Enables runtime speed improvements for weight-quantized models.
  • Compression_Validation -- Validates that the quantized model meets accuracy requirements.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment