Principle:Alibaba MNN Dynamic Quantization
| Field | Value |
|---|---|
| Principle Name | Dynamic_Quantization |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Runtime dynamic weight dequantization for accelerated inference on quantized models |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
By default, weight quantization in MNN is a storage-only optimization: weights are stored in a compressed format but dequantized back to floating-point before computation. This reduces model size but does not improve inference speed. Dynamic quantization changes this behavior by performing weight dequantization during GEMM (General Matrix Multiplication) operations, allowing the compute kernels to operate directly on low-precision integer data.
This approach combines two mechanisms:
- Compile-time flag (
MNN_LOW_MEMORY) -- Enables the compilation of specialized int8 GEMM kernels that can consume weight-quantized data directly. - Runtime configuration (
BackendConfig.memory = Memory_Low) -- Activates the low-memory inference path at runtime, instructing the scheduler to use the int8 compute kernels for applicable operations.
Theoretical Foundation
Trading Compute for Memory
The central insight behind dynamic quantization is that dequantization can be fused into the compute kernel rather than performed as a separate preprocessing step. In a standard weight-quantized inference:
- Load quantized weight from memory (int4/int8)
- Dequantize entire weight tensor to float32
- Perform float32 GEMM
With dynamic dequantization:
- Load quantized weight from memory (int4/int8)
- Dequantize weight blocks on-the-fly within the GEMM kernel
- Accumulate results in higher precision (int32 or float32)
This approach has several advantages:
- Reduced memory bandwidth -- Quantized weights occupy 2-4x less memory, reducing data movement from memory to compute units. Memory bandwidth is often the bottleneck on mobile and edge devices.
- Lower peak memory -- The full float32 weight tensor never needs to exist in memory simultaneously.
- Potential compute acceleration -- On hardware with int8 multiply-accumulate instructions (e.g., ARM NEON dot-product, x86 VNNI), the integer arithmetic can be faster than float32.
Two-Level Configuration
MNN implements dynamic quantization through a two-level configuration:
- Build level -- The
MNN_LOW_MEMORYCMake flag controls whether int8 weight-dequant GEMM kernels are compiled into the library. When disabled, the runtime low-memory path is unavailable regardless of runtime settings. A related flag,MNN_CPU_WEIGHT_DEQUANT_GEMM, specifically controls the CPU dequantization GEMM kernel compilation. - Runtime level -- The
BackendConfig.memoryenum selects the memory optimization level.Memory_Lowactivates the dynamic dequantization path.Memory_Normaluses standard float inference.Memory_Highmay cache additional intermediate data for speed at the cost of memory.
The BackendConfig.precision setting operates independently: Precision_Low enables FP16 compute on capable hardware, while Precision_Normal uses full precision. These two axes (memory and precision) can be combined orthogonally.
Applicable Operations
Dynamic dequantization primarily accelerates:
- Convolution (Conv2D, DepthwiseConv2D)
- MatMul (fully connected layers, attention layers)
- LSTM weight operations
These are the same operations whose weights are compressed by --weightQuantBits in the MNNConvert tool.
Relationship to Other Principles
- Weight_Quantization -- Dynamic quantization operates on models that have already been weight-quantized. The weight quantization step is a prerequisite.
- Compression_Strategy_Selection -- Dynamic quantization is the speed-acceleration path within the weight quantization strategy branch.
- Compression_Tool_Setup -- The MNN library must be built with
MNN_LOW_MEMORY=ONfor dynamic quantization to be available.