Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN Compression Decision Matrix

From Leeroopedia


Field Value
Implementation Name Compression_Decision_Matrix
Type Pattern Doc
Topic Model_Compression
Workflow Model_Compression
Description Decision matrix for selecting the optimal MNN model compression strategy
Source File(s) docs/tools/compress.md:L16-24
Last Updated 2026-02-10 14:00 GMT

API Signature

N/A (decision step -- no API invocation)

Strategy Comparison Matrix

The following matrix summarizes the four post-training compression strategies available in MNN, derived from the official documentation at docs/tools/compress.md:

Compression Type Requires Data Requires Training Size Reduction Inference Speedup Complexity
Weight Quantization (2-8 bit) No No 75%-87% No (default); Yes (with dynamic quantization) Low
FP16 Compression No No 50% No (storage only) Low
Auto Quant Tuning (4-8 bit) Yes (test dataset) No 75%-87% No (default); Yes (with dynamic quantization) Medium
Offline Quantization (8-bit INT8) Yes (calibration images, 100-1000) No 75% Yes Medium

Decision Criteria

Scenario 1: Size-Only Reduction

Recommended: Weight Quantization 8-bit

./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8
  • When to use: Model is too large for deployment, no calibration data available, no speed requirement.
  • Trade-off: ~75% size reduction, minimal accuracy loss, no speed improvement.

Scenario 2: Size + Speed (with calibration data)

Recommended: Offline INT8 Quantization

./quantized.out float.mnn quant_int8.mnn config.json
  • When to use: Need both smaller model and faster inference, have 100-1000 representative calibration images.
  • Trade-off: ~75% size reduction, significant speed improvement, moderate accuracy impact mitigated by calibration.

Scenario 3: Size + Speed (without calibration data)

Recommended: Weight Quantization + Dynamic Quantization

# Step 1: Weight-quantize the model
./MNNConvert --modelFile float.mnn --MNNModel quant.mnn --weightQuantBits 8

# Step 2: Build MNN with low-memory support
cmake .. -DMNN_LOW_MEMORY=ON && make -j8

# Step 3: Configure runtime for dynamic dequantization
# BackendConfig.memory = Memory_Low
  • When to use: No calibration data, but willing to rebuild MNN with MNN_LOW_MEMORY.
  • Trade-off: ~75% size reduction, speed improvement via int8 GEMM kernels, no calibration data needed.

Scenario 4: Accuracy-Critical Deployment

Recommended: auto_quant.py

python auto_quant.py --model float.mnn --quant_model quant.mnn --test_dir mnntest --rate 0.05
  • When to use: Quantization causes unacceptable accuracy degradation, need automated per-layer optimization.
  • Trade-off: Longer compression time (iterative search), but guarantees accuracy within the specified error rate.

Scenario 5: GPU Deployment

Recommended: FP16 Storage + Precision_Low Runtime

./MNNConvert --modelFile float.mnn --MNNModel fp16.mnn --fp16
BackendConfig backendConfig;
backendConfig.precision = BackendConfig::Precision_Low;  // runtime FP16 acceleration
config.backendConfig = &backendConfig;
  • When to use: Deploying on GPU hardware that supports half-precision arithmetic.
  • Trade-off: 50% size reduction, potential speed improvement on GPU, near-zero accuracy loss.

Inputs

  • Float MNN model -- The uncompressed model in MNN format (or a source format model to be converted).
  • Accuracy requirements -- Maximum tolerable error rate for the target application.
  • Target device constraints -- CPU vs. GPU, available memory, latency budget, whether MNN_LOW_MEMORY rebuild is feasible.

Outputs

  • Selected compression strategy -- One of: weight quantization, FP16 compression, offline INT8 quantization, or auto-quant tuning.
  • Tool and flag selection -- The specific binary (MNNConvert or quantized.out) and command-line flags to execute.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment