Implementation:Alibaba MNN MNNConvert Weight Quant
Appearance
| Field | Value |
|---|---|
| Implementation Name | MNNConvert_Weight_Quant |
| Type | API Doc |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | MNNConvert command-line interface for weight quantization with configurable bit-width, block size, and HQQ support |
| Source File(s) | tools/converter/source/common/cli.cpp:L190-506
|
| Last Updated | 2026-02-10 14:00 GMT |
API Signature
MNNConvert --modelFile <float.mnn> --MNNModel <quant.mnn> --weightQuantBits <2-8> [--hqq] [--weightQuantBlock <size>] [--weightQuantAsymmetric] [--fp16] [--compressionParamsFile <path>]
When converting from a source format (ONNX, TF, Caffe, TFLite), include the format flag:
MNNConvert -f ONNX --modelFile model.onnx --MNNModel quant.mnn --weightQuantBits 8
Source Definition
Flag definitions from tools/converter/source/common/cli.cpp (lines 209-236):
("weightQuantBits",
"save conv/matmul/LSTM float weights to int8 type, only optimize for model size, 2-8 bits, default: 0, which means no weight quant",
cxxopts::value<int>())
("weightQuantAsymmetric",
"the default weight-quant uses SYMMETRIC quant method, which is compatible with old MNN versions. "
"you can try set --weightQuantAsymmetric to use asymmetric quant method to improve accuracy of the weight-quant model in some cases, "
"but asymmetric quant model cannot run on old MNN versions.",
cxxopts::value<bool>())
("weightQuantBlock",
"using block-wise weight quant, set block size, defaut: -1, which means channel-wise weight quant",
cxxopts::value<int>())
("hqq",
"using hqq quant method to improve accuracy, default: false, if use hqq, weightQuantAsymmetric is set as true")
("compressionParamsFile",
"The path of the compression parameters that stores activation, "
"weight scales and zero points for quantization or information for sparsity.",
cxxopts::value<std::string>())
("fp16",
"save Conv's weight/bias in half_float data type")
Parsing logic from tools/converter/source/common/cli.cpp (lines 474-504):
if (result.count("fp16")) {
modelPath.saveHalfFloat = true;
}
if (result.count("weightQuantAsymmetric")) {
modelPath.weightQuantAsymmetric = result["weightQuantAsymmetric"].as<bool>();
}
if (result.count("hqq")) {
if(modelPath.weightQuantAsymmetric) {
modelPath.useHQQ = true;
} else {
std::cout << "Warning, MNN Convert only support Hqq with weight asymmetric quant! Disable Hqq currently" << std::endl;
}
}
if (result.count("weightQuantBits")) {
modelPath.weightQuantBits = result["weightQuantBits"].as<int>();
}
if (result.count("weightQuantBlock")) {
modelPath.weightQuantBlock = result["weightQuantBlock"].as<int>();
}
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--modelFile |
string | (required) | Path to the input model file (MNN, ONNX, TF, Caffe, or TFLite format). |
--MNNModel |
string | (required) | Path for the output compressed MNN model. |
--weightQuantBits |
int | 0 (disabled) | Target bit-width for weight quantization. Valid range: 2-8. Setting to 0 disables weight quantization. Applies to Conv, MatMul, and LSTM weight parameters. |
--hqq |
flag | false | Enables Half-Quadratic Quantization (HQQ) method for improved accuracy. Requires asymmetric quantization -- if --weightQuantAsymmetric is not set, HQQ is silently disabled with a warning.
|
--weightQuantBlock |
int | -1 (channel-wise) | Block size for block-wise weight quantization. Value of -1 uses channel-wise quantization (one scale per output channel). Recommended values: 32, 64, 128, 256. Smaller blocks improve accuracy at the cost of slightly larger model size. |
--weightQuantAsymmetric |
bool | false | Enables asymmetric quantization (computes both scale and zero point). Improves accuracy for non-symmetric weight distributions. Note: Asymmetric models require a newer MNN runtime and are not backward-compatible. |
--fp16 |
flag | false | Stores Conv weight and bias in half-precision (FP16) format. Provides 50% size reduction with near-zero accuracy loss. Can be used independently of or in combination with weight quantization. |
--compressionParamsFile |
string | (none) | Path to a JSON file containing per-layer quantization parameters (scales, zero points, bit-widths, block sizes). If the file does not exist, it is created based on the quantization options. Used by auto_quant.py for fine-grained per-layer control.
|
-f |
string | (none) | Source model format: ONNX, TF, CAFFE, TFLITE, or MNN. Required when converting from a non-MNN format.
|
Inputs
- Float model -- An uncompressed model in MNN format or a supported source format (ONNX, TensorFlow, Caffe, TFLite).
- (Optional) Compression params JSON -- A JSON file specifying per-layer quantization configurations for fine-grained control.
Outputs
- Compressed MNN model -- The output model file with quantized weights. Weight tensors are stored at the specified bit-width; non-weight tensors remain unchanged.
- (Optional) Compression params JSON -- If
--compressionParamsFileis specified and the file does not exist, it is generated with the quantization parameters used.
Usage Examples
Basic 8-bit Weight Quantization
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_quant.mnn --weightQuantBits 8
HQQ with Block-Wise Quantization
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_hqq.mnn \
--weightQuantBits 8 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 128
4-bit Quantization for LLMs
./MNNConvert --modelFile float.mnn --MNNModel quant_4bit.mnn \
--weightQuantBits 4 --hqq --weightQuantAsymmetric 1 --weightQuantBlock 64
FP16 Storage Compression
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel model_fp16.mnn --fp16
Re-Quantize with Custom Compression Params
./MNNConvert -f MNN --modelFile float.mnn --MNNModel auto_quant.mnn \
--compressionParamsFile auto_quant.mnn.json --hqq
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment