Implementation:Alibaba MNN Compression Decision Matrix

Field	Value
Implementation Name	Compression_Decision_Matrix
Type	Pattern Doc
Topic	Model_Compression
Workflow	Model_Compression
Description	Decision matrix for selecting the optimal MNN model compression strategy
Source File(s)	`docs/tools/compress.md:L16-24`
Last Updated	2026-02-10 14:00 GMT

API Signature

N/A (decision step -- no API invocation)

Strategy Comparison Matrix

The following matrix summarizes the four post-training compression strategies available in MNN, derived from the official documentation at docs/tools/compress.md:

Compression Type	Requires Data	Requires Training	Size Reduction	Inference Speedup	Complexity
Weight Quantization (2-8 bit)	No	No	75%-87%	No (default); Yes (with dynamic quantization)	Low
FP16 Compression	No	No	50%	No (storage only)	Low
Auto Quant Tuning (4-8 bit)	Yes (test dataset)	No	75%-87%	No (default); Yes (with dynamic quantization)	Medium
Offline Quantization (8-bit INT8)	Yes (calibration images, 100-1000)	No	75%	Yes	Medium

Decision Criteria

Scenario 1: Size-Only Reduction

Recommended: Weight Quantization 8-bit

./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8

When to use: Model is too large for deployment, no calibration data available, no speed requirement.
Trade-off: ~75% size reduction, minimal accuracy loss, no speed improvement.

Scenario 2: Size + Speed (with calibration data)

Recommended: Offline INT8 Quantization

./quantized.out float.mnn quant_int8.mnn config.json

When to use: Need both smaller model and faster inference, have 100-1000 representative calibration images.
Trade-off: ~75% size reduction, significant speed improvement, moderate accuracy impact mitigated by calibration.

Scenario 3: Size + Speed (without calibration data)

Recommended: Weight Quantization + Dynamic Quantization

# Step 1: Weight-quantize the model
./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8

# Step 2: Build MNN with low-memory support
cmake .. -DMNN_LOW_MEMORY=ON && make -j8

# Step 3: Configure runtime for dynamic dequantization
# BackendConfig.memory = Memory_Low

When to use: No calibration data, but willing to rebuild MNN with MNN_LOW_MEMORY.
Trade-off: ~75% size reduction, speed improvement via int8 GEMM kernels, no calibration data needed.

Scenario 4: Accuracy-Critical Deployment

Recommended: auto_quant.py

python auto_quant.py --model float.mnn --quant_model quant.mnn --test_dir mnntest --rate 0.05

When to use: Quantization causes unacceptable accuracy degradation, need automated per-layer optimization.
Trade-off: Longer compression time (iterative search), but guarantees accuracy within the specified error rate.

Scenario 5: GPU Deployment

Recommended: FP16 Storage + Precision_Low Runtime

./MNNConvert --modelFile float.mnn --MNNModel fp16.mnn --fp16

BackendConfig backendConfig;
backendConfig.precision = BackendConfig::Precision_Low;  // runtime FP16 acceleration
config.backendConfig = &backendConfig;

When to use: Deploying on GPU hardware that supports half-precision arithmetic.
Trade-off: 50% size reduction, potential speed improvement on GPU, near-zero accuracy loss.

Inputs

Float MNN model -- The uncompressed model in MNN format (or a source format model to be converted).
Accuracy requirements -- Maximum tolerable error rate for the target application.
Target device constraints -- CPU vs. GPU, available memory, latency budget, whether MNN_LOW_MEMORY rebuild is feasible.

Outputs

Selected compression strategy -- One of: weight quantization, FP16 compression, offline INT8 quantization, or auto-quant tuning.
Tool and flag selection -- The specific binary (MNNConvert or quantized.out) and command-line flags to execute.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment