Workflow:Alibaba MNN Model Compression
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Optimization, Model_Deployment |
| Last Updated | 2026-02-10 08:00 GMT |
Overview
End-to-end process for compressing MNN models using post-training techniques including weight quantization (2-8 bit), FP16 storage, automatic quantization tuning, and offline INT8 quantization with calibration data.
Description
This workflow covers MNN's post-training model compression pipeline, which requires no retraining. It includes four main compression strategies: weight quantization (reducing float32 weights to 2-8 bit integers for up to 87% size reduction), FP16 compression (50% size reduction with minimal accuracy loss), automatic quantization tuning (per-operator bit selection using test data), and offline quantization (full INT8 graph inference with calibration data for both speed and size improvement). The workflow also covers enabling dynamic quantization at runtime for actual inference acceleration on supported hardware (ARM v8.2 with sdot/smmla instructions).
Key outputs:
- Compressed MNN model file with reduced size (50-87% reduction depending on method)
- Optionally: runtime-accelerated inference via dynamic quantization
- Optionally: full INT8 quantized model via offline quantization with calibration data
Usage
Execute this workflow when you have an MNN model (obtained via the Model Conversion Pipeline) that is too large for your target device's storage or memory constraints, or when you need faster inference and are willing to accept minimal accuracy trade-offs. Weight quantization is recommended as the first approach; offline quantization provides the best speed improvement but requires representative calibration data.
Execution Steps
Step 1: Install compression tools
Obtain the MNNConvert tool and the offline quantization tool (quantized.out) either via pip install MNN (which provides mnnconvert and mnnquant CLI wrappers) or by compiling from source with -DMNN_BUILD_CONVERTER=ON -DMNN_BUILD_QUANTOOLS=ON. The Python package is recommended for experimentation; the compiled binaries are preferred for production pipelines.
Key considerations:
- pip install MNN provides mnnconvert, mnnquant, and mnn CLI tools
- Source compilation: cmake .. -DMNN_BUILD_CONVERTER=ON -DMNN_BUILD_QUANTOOLS=ON && make -j8
- The mnn CLI serves as a unified entry point for all tools
Step 2: Choose compression strategy
Select the appropriate compression method based on requirements. Weight quantization (--weightQuantBits) is the simplest and most broadly applicable, requiring no calibration data. FP16 (--fp16) provides lossless compression but only halves the size. Automatic quantization (auto_quant.py) optimizes per-operator bit selection using test data. Offline quantization (quantized.out / mnnquant) enables full INT8 inference but requires representative calibration images.
Key considerations:
- Weight quantization at 8 bits is nearly lossless with 4x size reduction
- Weight quantization at 4 bits provides 8x reduction with minor accuracy impact
- FP16 is safe for all models but only provides 2x reduction
- HQQ algorithm (--hqq) improves quantization quality at the cost of longer conversion time
- Block quantization (--weightQuantBlock 128 or 32) improves accuracy at slight size cost
Step 3: Apply weight quantization
Run MNNConvert with the --weightQuantBits flag to quantize conv/matmul/LSTM float32 weights to the specified bit width (2-8). Optionally apply block-wise quantization with --weightQuantBlock for improved precision, and the HQQ algorithm with --hqq for further accuracy improvement. Use --saveExternalData to separate weights into a .mnn.weight file for reduced peak memory during loading.
What happens:
- Float32 weights in convolution, matrix multiplication, and LSTM operators are linearly mapped to the target bit width
- Each quantization block produces a scale and bias for dequantization
- Smaller block sizes increase the number of scale/bias pairs, improving accuracy but slightly increasing model size
- The HQQ algorithm uses half-quadratic quantization for optimal weight mapping
Step 4: Enable dynamic quantization at runtime
For actual inference speedup (not just size reduction), compile MNN with -DMNN_LOW_MEMORY=ON and configure the runtime with Memory_Low mode. This enables dynamic quantization at inference time, where the weight-quantized model performs computation in INT8 rather than dequantizing back to float32 first. This provides 1-2x speedup on hardware with sdot/smmla support (ARM v8.2+).
Key considerations:
- Requires -DMNN_LOW_MEMORY=ON at compile time
- Runtime must set memory mode to "low" (Memory_Low in C++, memory=2 in Python)
- Only effective for 4-bit and 8-bit weight-quantized models
- Provides best acceleration on ARM v8.2+ devices with sdot/smmla instructions
- May introduce small accuracy differences compared to float32 inference
- Can be combined with FP16 precision for additional non-convolution acceleration
Step 5: Validate compressed model
Compare the inference results of the compressed model against the original float32 model to assess accuracy impact. Use the MNN test tools or custom validation scripts with representative test data. Check model size reduction, memory footprint during inference, and inference latency to verify compression goals are met.
Key considerations:
- Use the --info flag with MNNConvert to verify model metadata after compression
- Compare output tensors between original and compressed models on representative inputs
- Monitor peak memory usage to confirm reduced footprint
- Benchmark inference speed with and without dynamic quantization enabled