Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba MNN Weight Quantization Strategy

From Leeroopedia



Knowledge Sources
Domains Optimization, Quantization, Model_Compression
Last Updated 2026-02-10 14:00 GMT

Overview

Decision framework for selecting the right weight quantization approach in MNN, balancing model size, accuracy, and inference speed.

Description

MNN offers multiple quantization paths: weight-only quantization (2-8 bit), FP16 compression, dynamic quantization with MNN_LOW_MEMORY, and offline full-graph INT8 quantization. The choice depends on whether you need size reduction, speed improvement, or both. Weight-only quantization compresses storage but does not accelerate inference unless combined with dynamic quantization (Memory_Low mode). FP16 offers a simpler 50% size reduction with minimal accuracy impact. Full-graph INT8 quantization requires calibration data but provides both size and speed benefits.

Usage

Use this heuristic when deploying models to resource-constrained devices or when model size exceeds available memory/storage. Specifically apply when:

  • You need to reduce model file size for mobile or edge deployment.
  • You need to reduce runtime memory consumption on devices with limited RAM.
  • You want to accelerate inference on quantized models without full calibration.
  • You must decide between weight-only, dynamic, and full-graph quantization approaches.

The Insight (Rule of Thumb)

  • Action 1: For size reduction only: Use --weightQuantBits 8 (75% smaller, no speed change).
    • The quantized model still decompresses weights to float32 at runtime, so inference speed is unchanged.
  • Action 2: For size + speed: Use --weightQuantBits 4 with MNN_LOW_MEMORY=ON and Memory_Low at runtime.
    • This enables dynamic quantization where int8 GEMM kernels operate directly on quantized weights.
  • Action 3: For maximum accuracy with quantization: Use --hqq flag and --weightQuantBlock 128.
    • HQQ (Half-Quadratic Quantization) optimizes the quantization grid iteratively to minimize reconstruction error.
  • Action 4: Block size 32-128 recommended (smaller = higher precision, slightly larger model).
    • Block-wise quantization computes separate scale/zero-point per block rather than per channel, giving finer granularity.
  • Trade-off: HQQ increases quantization time but improves accuracy; dynamic quantization adds compute overhead from runtime dequantization but saves significant memory.

Reasoning

Weight quantization reduces memory footprint by representing float32 weights in lower bit-widths (2-8 bits). Without dynamic quantization, the runtime decompresses quantized weights back to float32 before computation, so only storage size is reduced. Dynamic quantization (Memory_Low mode) changes this by performing int8 computation at runtime rather than decompressing to float32, yielding both memory and speed benefits.

HQQ (Half-Quadratic Quantization) formulates the quantization parameter search as an optimization problem, directly minimizing the reconstruction error rather than relying on simple min/max statistics. This is particularly beneficial at lower bit-widths (4-bit and below) where quantization error is more pronounced.

Block-wise quantization (vs channel-wise) divides each weight channel into fixed-size blocks with independent quantization parameters. A block size of 128 provides a good balance between accuracy and overhead; block size 32 gives the finest granularity but increases metadata storage.

Code evidence from CMakeLists.txt:79:

option(MNN_LOW_MEMORY "Build MNN support low memory for weight quant model." OFF)

Code evidence from cli.cpp quantization options:

--weightQuantBits    Weight quantization bit count (2-8)
--weightQuantBlock   Block size for block-wise quantization (-1 for channel-wise)
--weightQuantAsymmetric  Use asymmetric quantization
--hqq               Use Half-Quadratic Quantization for better accuracy

Code evidence from docs/tools/compress.md dynamic quantization section:

When MNN_LOW_MEMORY=ON and Memory_Low is set at runtime, weight-quantized
models perform int8 computation directly instead of decompressing to float32,
providing both memory savings and inference acceleration.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment