Heuristic:Alibaba MNN Weight Quantization Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization, Model_Compression |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Decision framework for selecting the right weight quantization approach in MNN, balancing model size, accuracy, and inference speed.
Description
MNN offers multiple quantization paths: weight-only quantization (2-8 bit), FP16 compression, dynamic quantization with MNN_LOW_MEMORY, and offline full-graph INT8 quantization. The choice depends on whether you need size reduction, speed improvement, or both. Weight-only quantization compresses storage but does not accelerate inference unless combined with dynamic quantization (Memory_Low mode). FP16 offers a simpler 50% size reduction with minimal accuracy impact. Full-graph INT8 quantization requires calibration data but provides both size and speed benefits.
Usage
Use this heuristic when deploying models to resource-constrained devices or when model size exceeds available memory/storage. Specifically apply when:
- You need to reduce model file size for mobile or edge deployment.
- You need to reduce runtime memory consumption on devices with limited RAM.
- You want to accelerate inference on quantized models without full calibration.
- You must decide between weight-only, dynamic, and full-graph quantization approaches.
The Insight (Rule of Thumb)
- Action 1: For size reduction only: Use
--weightQuantBits 8(75% smaller, no speed change).- The quantized model still decompresses weights to float32 at runtime, so inference speed is unchanged.
- Action 2: For size + speed: Use
--weightQuantBits 4withMNN_LOW_MEMORY=ONandMemory_Lowat runtime.- This enables dynamic quantization where int8 GEMM kernels operate directly on quantized weights.
- Action 3: For maximum accuracy with quantization: Use
--hqqflag and--weightQuantBlock 128.- HQQ (Half-Quadratic Quantization) optimizes the quantization grid iteratively to minimize reconstruction error.
- Action 4: Block size 32-128 recommended (smaller = higher precision, slightly larger model).
- Block-wise quantization computes separate scale/zero-point per block rather than per channel, giving finer granularity.
- Trade-off: HQQ increases quantization time but improves accuracy; dynamic quantization adds compute overhead from runtime dequantization but saves significant memory.
Reasoning
Weight quantization reduces memory footprint by representing float32 weights in lower bit-widths (2-8 bits). Without dynamic quantization, the runtime decompresses quantized weights back to float32 before computation, so only storage size is reduced. Dynamic quantization (Memory_Low mode) changes this by performing int8 computation at runtime rather than decompressing to float32, yielding both memory and speed benefits.
HQQ (Half-Quadratic Quantization) formulates the quantization parameter search as an optimization problem, directly minimizing the reconstruction error rather than relying on simple min/max statistics. This is particularly beneficial at lower bit-widths (4-bit and below) where quantization error is more pronounced.
Block-wise quantization (vs channel-wise) divides each weight channel into fixed-size blocks with independent quantization parameters. A block size of 128 provides a good balance between accuracy and overhead; block size 32 gives the finest granularity but increases metadata storage.
Code evidence from CMakeLists.txt:79:
option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)
Code evidence from cli.cpp quantization options:
--weightQuantBits Weight quantization bit count (2-8)
--weightQuantBlock Block size for block-wise quantization (-1 for channel-wise)
--weightQuantAsymmetric Use asymmetric quantization
--hqq Use Half-Quadratic Quantization for better accuracy
Code evidence from docs/tools/compress.md dynamic quantization section:
When MNN_LOW_MEMORY=ON and Memory_Low is set at runtime, weight-quantized
models perform int8 computation directly instead of decompressing to float32,
providing both memory savings and inference acceleration.