Principle:Alibaba MNN Compression Strategy Selection
| Field | Value |
|---|---|
| Principle Name | Compression_Strategy_Selection |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Selecting optimal model compression strategy based on deployment constraints |
| Last Updated | 2026-02-10 14:00 GMT |
Overview
Choosing the right model compression strategy is a critical deployment decision that determines the trade-off between model size, inference speed, accuracy, and implementation complexity. MNN provides four distinct post-training compression approaches, each targeting different points on the compression-quality Pareto frontier. This principle establishes a decision framework for selecting the optimal strategy based on target device capabilities, accuracy requirements, and latency budget.
Compression Approaches
MNN supports the following post-training compression strategies:
- Weight Quantization (2-8 bit) -- Compresses weight storage from FP32 to lower bit-widths. Reduces model size by 75-87% without requiring calibration data. By default, weights are dequantized back to float at inference time, so there is no speed improvement unless dynamic quantization is enabled.
- FP16 Compression -- Stores weights in half-precision floating-point format. Reduces model size by 50% with virtually no accuracy loss. This is independent of runtime FP16 acceleration (
Precision_Low), which can be applied separately. - Automatic Quantization Tuning (auto_quant.py, 4-8 bit) -- Automatically searches for the best per-layer quantization configuration (bit-width, block size) within a user-specified error budget. Requires a test dataset for accuracy validation. Produces the smallest model that meets the accuracy target.
- Offline Quantization (8-bit INT8) -- Full-graph INT8 quantization using calibration images (100-1000 samples). Both weights and activations are quantized, enabling integer arithmetic during inference for both size reduction and speed improvement.
Trade-Off Analysis
Size vs. Speed
Weight quantization and FP16 compression are storage-only optimizations by default. The model occupies less disk and memory, but inference still operates on floating-point values. To achieve actual speed improvement from weight quantization, the dynamic quantization runtime mode (MNN_LOW_MEMORY build + Memory_Low config) must be enabled.
Offline INT8 quantization is the only approach that provides both size reduction and inference acceleration out of the box, because both weights and activations use integer arithmetic.
Accuracy vs. Compression Ratio
The accuracy impact increases with the aggressiveness of compression:
- FP16 -- Near-zero accuracy loss; safe for virtually all models.
- Weight Quant 8-bit -- Minimal accuracy loss for most models; does not require calibration data.
- Weight Quant 4-bit -- Noticeable accuracy loss for sensitive models; block-wise quantization and HQQ can mitigate degradation.
- Weight Quant 2-bit -- Significant accuracy loss; only suitable for large models with high parameter redundancy (e.g., large language models).
- Offline INT8 -- Moderate accuracy loss controlled by calibration quality; KL-divergence calibration with 100-1000 images typically recovers most accuracy.
Complexity vs. Quality
- Low complexity -- Weight quantization and FP16 are one-command operations with no data requirements.
- Medium complexity -- Offline INT8 requires preparing a calibration dataset and a JSON configuration file specifying preprocessing parameters.
- Higher complexity --
auto_quant.pyrequires a structured test directory and iteratively evaluates different quantization configurations, but it automates the search for optimal parameters.
Decision Framework
The following decision logic guides strategy selection:
- Size-only reduction, no data available -- Use weight quantization 8-bit. This is the simplest approach with reliable accuracy.
- Size reduction + inference speed -- Use offline INT8 quantization (if calibration data is available) or weight quantization + dynamic quantization (if no calibration data, but
MNN_LOW_MEMORYbuild is acceptable). - Accuracy-critical deployment -- Use auto_quant.py to automatically find the best per-layer configuration that stays within the error budget.
- GPU deployment -- Use FP16 storage (
--fp16) combined with Precision_Low runtime setting for hardware-accelerated half-precision inference. - Maximum compression with acceptable accuracy loss -- Combine weight quantization 4-bit with HQQ and block-wise quantization for aggressive size reduction with error mitigation.
Relationship to Other Principles
- Compression_Tool_Setup -- Tools must be built before any strategy can be applied.
- Weight_Quantization -- Implements the weight quantization branch of this decision framework.
- Dynamic_Quantization -- Implements the speed acceleration path for weight-quantized models.
- Compression_Validation -- Validates the chosen strategy and enables automated strategy search.