Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Alibaba MNN Model Compression

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Optimization, Model_Deployment
Last Updated 2026-02-10 08:00 GMT

Overview

End-to-end process for compressing MNN models using post-training techniques including weight quantization (2-8 bit), FP16 storage, automatic quantization tuning, and offline INT8 quantization with calibration data.

Description

This workflow covers MNN's post-training model compression pipeline, which requires no retraining. It includes four main compression strategies: weight quantization (reducing float32 weights to 2-8 bit integers for up to 87% size reduction), FP16 compression (50% size reduction with minimal accuracy loss), automatic quantization tuning (per-operator bit selection using test data), and offline quantization (full INT8 graph inference with calibration data for both speed and size improvement). The workflow also covers enabling dynamic quantization at runtime for actual inference acceleration on supported hardware (ARM v8.2 with sdot/smmla instructions).

Key outputs:

  • Compressed MNN model file with reduced size (50-87% reduction depending on method)
  • Optionally: runtime-accelerated inference via dynamic quantization
  • Optionally: full INT8 quantized model via offline quantization with calibration data

Usage

Execute this workflow when you have an MNN model (obtained via the Model Conversion Pipeline) that is too large for your target device's storage or memory constraints, or when you need faster inference and are willing to accept minimal accuracy trade-offs. Weight quantization is recommended as the first approach; offline quantization provides the best speed improvement but requires representative calibration data.

Execution Steps

Step 1: Install compression tools

Obtain the MNNConvert tool and the offline quantization tool (quantized.out) either via pip install MNN (which provides mnnconvert and mnnquant CLI wrappers) or by compiling from source with -DMNN_BUILD_CONVERTER=ON -DMNN_BUILD_QUANTOOLS=ON. The Python package is recommended for experimentation; the compiled binaries are preferred for production pipelines.

Key considerations:

  • pip install MNN provides mnnconvert, mnnquant, and mnn CLI tools
  • Source compilation: cmake .. -DMNN_BUILD_CONVERTER=ON -DMNN_BUILD_QUANTOOLS=ON && make -j8
  • The mnn CLI serves as a unified entry point for all tools

Step 2: Choose compression strategy

Select the appropriate compression method based on requirements. Weight quantization (--weightQuantBits) is the simplest and most broadly applicable, requiring no calibration data. FP16 (--fp16) provides lossless compression but only halves the size. Automatic quantization (auto_quant.py) optimizes per-operator bit selection using test data. Offline quantization (quantized.out / mnnquant) enables full INT8 inference but requires representative calibration images.

Key considerations:

  • Weight quantization at 8 bits is nearly lossless with 4x size reduction
  • Weight quantization at 4 bits provides 8x reduction with minor accuracy impact
  • FP16 is safe for all models but only provides 2x reduction
  • HQQ algorithm (--hqq) improves quantization quality at the cost of longer conversion time
  • Block quantization (--weightQuantBlock 128 or 32) improves accuracy at slight size cost

Step 3: Apply weight quantization

Run MNNConvert with the --weightQuantBits flag to quantize conv/matmul/LSTM float32 weights to the specified bit width (2-8). Optionally apply block-wise quantization with --weightQuantBlock for improved precision, and the HQQ algorithm with --hqq for further accuracy improvement. Use --saveExternalData to separate weights into a .mnn.weight file for reduced peak memory during loading.

What happens:

  • Float32 weights in convolution, matrix multiplication, and LSTM operators are linearly mapped to the target bit width
  • Each quantization block produces a scale and bias for dequantization
  • Smaller block sizes increase the number of scale/bias pairs, improving accuracy but slightly increasing model size
  • The HQQ algorithm uses half-quadratic quantization for optimal weight mapping

Step 4: Enable dynamic quantization at runtime

For actual inference speedup (not just size reduction), compile MNN with -DMNN_LOW_MEMORY=ON and configure the runtime with Memory_Low mode. This enables dynamic quantization at inference time, where the weight-quantized model performs computation in INT8 rather than dequantizing back to float32 first. This provides 1-2x speedup on hardware with sdot/smmla support (ARM v8.2+).

Key considerations:

  • Requires -DMNN_LOW_MEMORY=ON at compile time
  • Runtime must set memory mode to "low" (Memory_Low in C++, memory=2 in Python)
  • Only effective for 4-bit and 8-bit weight-quantized models
  • Provides best acceleration on ARM v8.2+ devices with sdot/smmla instructions
  • May introduce small accuracy differences compared to float32 inference
  • Can be combined with FP16 precision for additional non-convolution acceleration

Step 5: Validate compressed model

Compare the inference results of the compressed model against the original float32 model to assess accuracy impact. Use the MNN test tools or custom validation scripts with representative test data. Check model size reduction, memory footprint during inference, and inference latency to verify compression goals are met.

Key considerations:

  • Use the --info flag with MNNConvert to verify model metadata after compression
  • Compare output tensors between original and compressed models on representative inputs
  • Monitor peak memory usage to confirm reduced footprint
  • Benchmark inference speed with and without dynamic quantization enabled

Execution Diagram

GitHub URL

Workflow Repository