Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN MNNConvert Weight Quant

From Leeroopedia


Field Value
Implementation Name MNNConvert_Weight_Quant
Type API Doc
Topic Model_Compression
Workflow Model_Compression
Description MNNConvert command-line interface for weight quantization with configurable bit-width, block size, and HQQ support
Source File(s) tools/converter/source/common/cli.cpp:L190-506
Last Updated 2026-02-10 14:00 GMT

API Signature

MNNConvert --modelFile <float.mnn> --MNNModel <quant.mnn> --weightQuantBits <2-8> [--hqq] [--weightQuantBlock <size>] [--weightQuantAsymmetric] [--fp16] [--compressionParamsFile <path>]

When converting from a source format (ONNX, TF, Caffe, TFLite), include the format flag:

MNNConvert -f ONNX --modelFile model.onnx --MNNModel quant.mnn --weightQuantBits 8

Source Definition

Flag definitions from tools/converter/source/common/cli.cpp (lines 209-236):

("weightQuantBits",
 "save conv/matmul/LSTM float weights to int8 type, only optimize for model size, 2-8 bits, default: 0, which means no weight quant",
 cxxopts::value<int>())
("weightQuantAsymmetric",
 "the default weight-quant uses SYMMETRIC quant method, which is compatible with old MNN versions. "
 "you can try set --weightQuantAsymmetric to use asymmetric quant method to improve accuracy of the weight-quant model in some cases, "
 "but asymmetric quant model cannot run on old MNN versions.",
 cxxopts::value<bool>())
("weightQuantBlock",
 "using block-wise weight quant, set block size, defaut: -1, which means channel-wise weight quant",
 cxxopts::value<int>())
("hqq",
 "using hqq quant method to improve accuracy, default: false, if use hqq, weightQuantAsymmetric is set as true")
("compressionParamsFile",
 "The path of the compression parameters that stores activation, "
 "weight scales and zero points for quantization or information for sparsity.",
 cxxopts::value<std::string>())
("fp16",
 "save Conv's weight/bias in half_float data type")

Parsing logic from tools/converter/source/common/cli.cpp (lines 474-504):

if (result.count("fp16")) {
    modelPath.saveHalfFloat = true;
}
if (result.count("weightQuantAsymmetric")) {
    modelPath.weightQuantAsymmetric = result["weightQuantAsymmetric"].as<bool>();
}
if (result.count("hqq")) {
    if(modelPath.weightQuantAsymmetric) {
        modelPath.useHQQ = true;
    } else {
        std::cout << "Warning, MNN Convert only support Hqq with weight asymmetric quant! Disable Hqq currently" << std::endl;
    }
}
if (result.count("weightQuantBits")) {
    modelPath.weightQuantBits = result["weightQuantBits"].as<int>();
}
if (result.count("weightQuantBlock")) {
    modelPath.weightQuantBlock = result["weightQuantBlock"].as<int>();
}

Parameters

Parameter Type Default Description
--modelFile string (required) Path to the input model file (MNN, ONNX, TF, Caffe, or TFLite format).
--MNNModel string (required) Path for the output compressed MNN model.
--weightQuantBits int 0 (disabled) Target bit-width for weight quantization. Valid range: 2-8. Setting to 0 disables weight quantization. Applies to Conv, MatMul, and LSTM weight parameters.
--hqq flag false Enables Half-Quadratic Quantization (HQQ) method for improved accuracy. Requires asymmetric quantization -- if --weightQuantAsymmetric is not set, HQQ is silently disabled with a warning.
--weightQuantBlock int -1 (channel-wise) Block size for block-wise weight quantization. Value of -1 uses channel-wise quantization (one scale per output channel). Recommended values: 32, 64, 128, 256. Smaller blocks improve accuracy at the cost of slightly larger model size.
--weightQuantAsymmetric bool false Enables asymmetric quantization (computes both scale and zero point). Improves accuracy for non-symmetric weight distributions. Note: Asymmetric models require a newer MNN runtime and are not backward-compatible.
--fp16 flag false Stores Conv weight and bias in half-precision (FP16) format. Provides 50% size reduction with near-zero accuracy loss. Can be used independently of or in combination with weight quantization.
--compressionParamsFile string (none) Path to a JSON file containing per-layer quantization parameters (scales, zero points, bit-widths, block sizes). If the file does not exist, it is created based on the quantization options. Used by auto_quant.py for fine-grained per-layer control.
-f string (none) Source model format: ONNX, TF, CAFFE, TFLITE, or MNN. Required when converting from a non-MNN format.

Inputs

  • Float model -- An uncompressed model in MNN format or a supported source format (ONNX, TensorFlow, Caffe, TFLite).
  • (Optional) Compression params JSON -- A JSON file specifying per-layer quantization configurations for fine-grained control.

Outputs

  • Compressed MNN model -- The output model file with quantized weights. Weight tensors are stored at the specified bit-width; non-weight tensors remain unchanged.
  • (Optional) Compression params JSON -- If --compressionParamsFile is specified and the file does not exist, it is generated with the quantization parameters used.

Usage Examples

Basic 8-bit Weight Quantization

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8

HQQ with Block-Wise Quantization

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_hqq.mnn \
    --weightQuantBits 8 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 128

4-bit Quantization for LLMs

./MNNConvert --modelFile float.mnn --MNNModel quant_4bit.mnn \
    --weightQuantBits 4 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 64

FP16 Storage Compression

./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_fp16.mnn --fp16

Re-Quantize with Custom Compression Params

./MNNConvert -f MNN --modelFile float.mnn --MNNModel auto_quant.mnn \
    --compressionParamsFile auto_quant.mnn.json --hqq

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment