Implementation:Alibaba MNN Compression Decision Matrix
Appearance
| Field | Value |
|---|---|
| Implementation Name | Compression_Decision_Matrix |
| Type | Pattern Doc |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Decision matrix for selecting the optimal MNN model compression strategy |
| Source File(s) | docs/tools/compress.md:L16-24
|
| Last Updated | 2026-02-10 14:00 GMT |
API Signature
N/A (decision step -- no API invocation)
Strategy Comparison Matrix
The following matrix summarizes the four post-training compression strategies available in MNN, derived from the official documentation at docs/tools/compress.md:
| Compression Type | Requires Data | Requires Training | Size Reduction | Inference Speedup | Complexity |
|---|---|---|---|---|---|
| Weight Quantization (2-8 bit) | No | No | 75%-87% | No (default); Yes (with dynamic quantization) | Low |
| FP16 Compression | No | No | 50% | No (storage only) | Low |
| Auto Quant Tuning (4-8 bit) | Yes (test dataset) | No | 75%-87% | No (default); Yes (with dynamic quantization) | Medium |
| Offline Quantization (8-bit INT8) | Yes (calibration images, 100-1000) | No | 75% | Yes | Medium |
Decision Criteria
Scenario 1: Size-Only Reduction
Recommended: Weight Quantization 8-bit
./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8
- When to use: Model is too large for deployment, no calibration data available, no speed requirement.
- Trade-off: ~75% size reduction, minimal accuracy loss, no speed improvement.
Scenario 2: Size + Speed (with calibration data)
Recommended: Offline INT8 Quantization
./quantized.out float.mnn quant_int8.mnn config.json
- When to use: Need both smaller model and faster inference, have 100-1000 representative calibration images.
- Trade-off: ~75% size reduction, significant speed improvement, moderate accuracy impact mitigated by calibration.
Scenario 3: Size + Speed (without calibration data)
Recommended: Weight Quantization + Dynamic Quantization
# Step 1: Weight-quantize the model
./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8
# Step 2: Build MNN with low-memory support
cmake .. -DMNN_LOW_MEMORY=ON && make -j8
# Step 3: Configure runtime for dynamic dequantization
# BackendConfig.memory = Memory_Low
- When to use: No calibration data, but willing to rebuild MNN with
MNN_LOW_MEMORY. - Trade-off: ~75% size reduction, speed improvement via int8 GEMM kernels, no calibration data needed.
Scenario 4: Accuracy-Critical Deployment
Recommended: auto_quant.py
python auto_quant.py --model float.mnn --quant_model quant.mnn --test_dir mnntest --rate 0.05
- When to use: Quantization causes unacceptable accuracy degradation, need automated per-layer optimization.
- Trade-off: Longer compression time (iterative search), but guarantees accuracy within the specified error rate.
Scenario 5: GPU Deployment
Recommended: FP16 Storage + Precision_Low Runtime
./MNNConvert --modelFile float.mnn --MNNModel fp16.mnn --fp16
BackendConfig backendConfig;
backendConfig.precision = BackendConfig::Precision_Low; // runtime FP16 acceleration
config.backendConfig = &backendConfig;
- When to use: Deploying on GPU hardware that supports half-precision arithmetic.
- Trade-off: 50% size reduction, potential speed improvement on GPU, near-zero accuracy loss.
Inputs
- Float MNN model -- The uncompressed model in MNN format (or a source format model to be converted).
- Accuracy requirements -- Maximum tolerable error rate for the target application.
- Target device constraints -- CPU vs. GPU, available memory, latency budget, whether
MNN_LOW_MEMORYrebuild is feasible.
Outputs
- Selected compression strategy -- One of: weight quantization, FP16 compression, offline INT8 quantization, or auto-quant tuning.
- Tool and flag selection -- The specific binary (
MNNConvertorquantized.out) and command-line flags to execute.
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment