Implementation:Alibaba MNN MNNConvert Weight Quant

Field	Value
Implementation Name	MNNConvert_Weight_Quant
Type	API Doc
Topic	Model_Compression
Workflow	Model_Compression
Description	MNNConvert command-line interface for weight quantization with configurable bit-width, block size, and HQQ support
Source File(s)	`tools/converter/source/common/cli.cpp:L190-506`
Last Updated	2026-02-10 14:00 GMT

API Signature

MNNConvert --modelFile <float.mnn> --MNNModel <quant.mnn> --weightQuantBits <2-8> [--hqq] [--weightQuantBlock <size>] [--weightQuantAsymmetric] [--fp16] [--compressionParamsFile <path>]

When converting from a source format (ONNX, TF, Caffe, TFLite), include the format flag:

MNNConvert -f ONNX --modelFile model.onnx --MNNModel quant.mnn --weightQuantBits 8

Source Definition

Flag definitions from tools/converter/source/common/cli.cpp (lines 209-236):

("weightQuantBits",
 "save conv/matmul/LSTM float weights to int8 type, only optimize for model size, 2-8 bits, default: 0, which means no weight quant",
 cxxopts::value<int>())
("weightQuantAsymmetric",
 "the default weight-quant uses SYMMETRIC quant method, which is compatible with old MNN versions. "
 "you can try set --weightQuantAsymmetric to use asymmetric quant method to improve accuracy of the weight-quant model in some cases, "
 "but asymmetric quant model cannot run on old MNN versions.",
 cxxopts::value<bool>())
("weightQuantBlock",
 "using block-wise weight quant, set block size, defaut: -1, which means channel-wise weight quant",
 cxxopts::value<int>())
("hqq",
 "using hqq quant method to improve accuracy, default: false, if use hqq, weightQuantAsymmetric is set as true")
("compressionParamsFile",
 "The path of the compression parameters that stores activation, "
 "weight scales and zero points for quantization or information for sparsity.",
 cxxopts::value<std::string>())
("fp16",
 "save Conv's weight/bias in half_float data type")

Parsing logic from tools/converter/source/common/cli.cpp (lines 474-504):

if (result.count("fp16")) {
    modelPath.saveHalfFloat = true;
}
if (result.count("weightQuantAsymmetric")) {
    modelPath.weightQuantAsymmetric = result["weightQuantAsymmetric"].as<bool>();
}
if (result.count("hqq")) {
    if(modelPath.weightQuantAsymmetric) {
        modelPath.useHQQ = true;
    } else {
        std::cout << "Warning, MNN Convert only support Hqq with weight asymmetric quant! Disable Hqq currently" << std::endl;
    }
}
if (result.count("weightQuantBits")) {
    modelPath.weightQuantBits = result["weightQuantBits"].as<int>();
}
if (result.count("weightQuantBlock")) {
    modelPath.weightQuantBlock = result["weightQuantBlock"].as<int>();
}

Parameters

Parameter	Type	Default	Description
`--modelFile`	string	(required)	Path to the input model file (MNN, ONNX, TF, Caffe, or TFLite format).
`--MNNModel`	string	(required)	Path for the output compressed MNN model.
`--weightQuantBits`	int	0 (disabled)	Target bit-width for weight quantization. Valid range: 2-8. Setting to 0 disables weight quantization. Applies to Conv, MatMul, and LSTM weight parameters.
`--hqq`	flag	false	Enables Half-Quadratic Quantization (HQQ) method for improved accuracy. Requires asymmetric quantization -- if `--weightQuantAsymmetric` is not set, HQQ is silently disabled with a warning.
`--weightQuantBlock`	int	-1 (channel-wise)	Block size for block-wise weight quantization. Value of -1 uses channel-wise quantization (one scale per output channel). Recommended values: 32, 64, 128, 256. Smaller blocks improve accuracy at the cost of slightly larger model size.
`--weightQuantAsymmetric`	bool	false	Enables asymmetric quantization (computes both scale and zero point). Improves accuracy for non-symmetric weight distributions. Note: Asymmetric models require a newer MNN runtime and are not backward-compatible.
`--fp16`	flag	false	Stores Conv weight and bias in half-precision (FP16) format. Provides 50% size reduction with near-zero accuracy loss. Can be used independently of or in combination with weight quantization.
`--compressionParamsFile`	string	(none)	Path to a JSON file containing per-layer quantization parameters (scales, zero points, bit-widths, block sizes). If the file does not exist, it is created based on the quantization options. Used by `auto_quant.py` for fine-grained per-layer control.
`-f`	string	(none)	Source model format: `ONNX`, `TF`, `CAFFE`, `TFLITE`, or `MNN`. Required when converting from a non-MNN format.

Inputs

Float model -- An uncompressed model in MNN format or a supported source format (ONNX, TensorFlow, Caffe, TFLite).
(Optional) Compression params JSON -- A JSON file specifying per-layer quantization configurations for fine-grained control.

Outputs

Compressed MNN model -- The output model file with quantized weights. Weight tensors are stored at the specified bit-width; non-weight tensors remain unchanged.
(Optional) Compression params JSON -- If --compressionParamsFile is specified and the file does not exist, it is generated with the quantization parameters used.

Usage Examples

Basic 8-bit Weight Quantization

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8

HQQ with Block-Wise Quantization

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_hqq.mnn \
    --weightQuantBits 8 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 128

4-bit Quantization for LLMs

./MNNConvert --modelFile float.mnn --MNNModel quant_4bit.mnn \
    --weightQuantBits 4 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 64

FP16 Storage Compression

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_fp16.mnn --fp16

Re-Quantize with Custom Compression Params

./MNNConvert -f MNN --modelFile float.mnn --MNNModel auto_quant.mnn \
    --compressionParamsFile auto_quant.mnn.json --hqq

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment