Heuristic:Alibaba MNN Weight Quantization Strategy

Knowledge Sources	MNN docs/tools/compress.md CMakeLists.txt
Domains	Optimization, Quantization, Model_Compression
Last Updated	2026-02-10 14:00 GMT

Overview

Decision framework for selecting the right weight quantization approach in MNN, balancing model size, accuracy, and inference speed.

Description

MNN offers multiple quantization paths: weight-only quantization (2-8 bit), FP16 compression, dynamic quantization with MNN_LOW_MEMORY, and offline full-graph INT8 quantization. The choice depends on whether you need size reduction, speed improvement, or both. Weight-only quantization compresses storage but does not accelerate inference unless combined with dynamic quantization (Memory_Low mode). FP16 offers a simpler 50% size reduction with minimal accuracy impact. Full-graph INT8 quantization requires calibration data but provides both size and speed benefits.

Usage

Use this heuristic when deploying models to resource-constrained devices or when model size exceeds available memory/storage. Specifically apply when:

You need to reduce model file size for mobile or edge deployment.
You need to reduce runtime memory consumption on devices with limited RAM.
You want to accelerate inference on quantized models without full calibration.
You must decide between weight-only, dynamic, and full-graph quantization approaches.

The Insight (Rule of Thumb)

Action 1: For size reduction only: Use --weightQuantBits 8 (75% smaller, no speed change).
- The quantized model still decompresses weights to float32 at runtime, so inference speed is unchanged.
Action 2: For size + speed: Use --weightQuantBits 4 with MNN_LOW_MEMORY=ON and Memory_Low at runtime.
- This enables dynamic quantization where int8 GEMM kernels operate directly on quantized weights.
Action 3: For maximum accuracy with quantization: Use --hqq flag and --weightQuantBlock 128.
- HQQ (Half-Quadratic Quantization) optimizes the quantization grid iteratively to minimize reconstruction error.
Action 4: Block size 32-128 recommended (smaller = higher precision, slightly larger model).
- Block-wise quantization computes separate scale/zero-point per block rather than per channel, giving finer granularity.
Trade-off: HQQ increases quantization time but improves accuracy; dynamic quantization adds compute overhead from runtime dequantization but saves significant memory.

Reasoning

Weight quantization reduces memory footprint by representing float32 weights in lower bit-widths (2-8 bits). Without dynamic quantization, the runtime decompresses quantized weights back to float32 before computation, so only storage size is reduced. Dynamic quantization (Memory_Low mode) changes this by performing int8 computation at runtime rather than decompressing to float32, yielding both memory and speed benefits.

HQQ (Half-Quadratic Quantization) formulates the quantization parameter search as an optimization problem, directly minimizing the reconstruction error rather than relying on simple min/max statistics. This is particularly beneficial at lower bit-widths (4-bit and below) where quantization error is more pronounced.

Block-wise quantization (vs channel-wise) divides each weight channel into fixed-size blocks with independent quantization parameters. A block size of 128 provides a good balance between accuracy and overhead; block size 32 gives the finest granularity but increases metadata storage.

Code evidence from CMakeLists.txt:79:

option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)

Code evidence from cli.cpp quantization options:

--weightQuantBits    Weight quantization bit count (2-8)
--weightQuantBlock   Block size for block-wise quantization (-1 for channel-wise)
--weightQuantAsymmetric  Use asymmetric quantization
--hqq               Use Half-Quadratic Quantization for better accuracy

Code evidence from docs/tools/compress.md dynamic quantization section:

When MNN_LOW_MEMORY=ON and Memory_Low is set at runtime, weight-quantized
models perform int8 computation directly instead of decompressing to float32,
providing both memory savings and inference acceleration.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment