Principle:Alibaba MNN Compression Validation
| Field | Value |
|---|---|
| Principle Name | Compression_Validation |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Validating compressed model accuracy and finding optimal quantization parameters |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Model compression inevitably introduces quantization error. Compression validation is the process of measuring this error and determining whether the compressed model meets the deployment accuracy requirements. Beyond simple pass/fail validation, MNN provides an automated search mechanism that iteratively explores quantization configurations (bit-widths per layer, block sizes) to find the best compression-accuracy trade-off within a user-specified error budget.
This principle covers two complementary validation workflows:
- Automated weight quantization tuning (auto_quant.py) -- Searches over per-layer bit-widths and block sizes to maximize compression while keeping the relative error below a target rate.
- Offline INT8 quantization with calibration (quantized.out) -- Uses calibration data to compute activation scales via KL-divergence, ADMM, or EMA methods, producing a fully quantized INT8 model.
Theoretical Foundation
Error Measurement
MNN measures compression error as the maximum relative difference between the float model and compressed model outputs across a test dataset:
error_rate = max(diffMax / absMaxV)
Where:
- diffMax is the maximum absolute difference between the float and quantized model outputs for a given test input.
- absMaxV is the maximum absolute value of the float model output.
This relative error metric captures the worst-case degradation across all outputs and test samples. A typical target rate is 0.05 (5% maximum relative error).
Automated Search Strategy
The auto_quant.py tool uses a multi-phase search to find optimal quantization parameters:
Phase 1: Bit-Width Selection -- Starting from 8-bit quantization with block size 64, the tool iterates over layers (sorted by parameter count) and attempts to reduce each layer to 4-bit. If the error exceeds the target rate after downgrading a layer, it rolls back to 8-bit for that layer. This identifies which layers are sensitive to aggressive quantization.
Phase 2: Block Size Optimization -- The tool searches block sizes (256, 128, 64, 32) to find the smallest block size that keeps the error below the target. For mixed configurations, a binary search determines the optimal split point where larger layers use a coarser block size and smaller layers use a finer block size.
Phase 3: Skip Quantization for Sensitive Layers -- If the error still exceeds the target after phases 1 and 2, the tool iterates from the last layer to the first, setting individual layers' bits to 0 (skip quantization entirely) until the error drops below the target.
Calibration-Based Quantization
For offline INT8 quantization, the quantized.out tool uses calibration data to compute activation quantization scales. Three methods are supported:
- KL-divergence -- Finds the threshold that minimizes the information loss (KL divergence) between the original floating-point distribution and the quantized distribution. The feature map range is divided into 2048 bins for histogram analysis. Requires 100-1000 calibration images.
- ADMM (Alternating Direction Method of Multipliers) -- Formulates scale computation as an optimization problem. Generally requires fewer calibration samples (one batch).
- EMA (Exponential Moving Average) -- Computes quantization parameters using exponential moving averages of activation statistics. Supports asymmetric quantization and may provide better accuracy. Batch size should match training conditions.
Weight quantization for offline INT8 uses either:
- MAX_ABS (default) -- Symmetric quantization using the maximum absolute weight value.
- ADMM -- Optimization-based weight quantization for improved accuracy.
Relationship to Other Principles
- Weight_Quantization -- The auto_quant tool validates and optimizes the output of weight quantization.
- Compression_Strategy_Selection -- Validation results inform strategy selection decisions; auto_quant is the recommended path for accuracy-critical deployments.
- Compression_Tool_Setup -- Both
MNNConvert(used by auto_quant) andquantized.outmust be built.