Implementation:Alibaba MNN Auto Quant Validation
Appearance
| Field | Value |
|---|---|
| Implementation Name | Auto_Quant_Validation |
| Type | API Doc |
| Topic | Model_Compression |
| Workflow | Model_Compression |
| Description | Automated quantization parameter search and offline INT8 calibration tools for compression validation |
| Source File(s) | tools/converter/tools/auto_quant.py:L132-225, tools/quantization/calibration.cpp:L1661, tools/quantization/quantized.cpp:L14-16
|
| Last Updated | 2026-02-10 14:00 GMT |
API Signatures
Automated Weight Quantization Tuning
python auto_quant.py --model <float.mnn> --quant_model <quant.mnn> --test_dir <dir> --rate 0.05 [--select_bits 1] [--select_block 1] [--hqq 1]
Offline INT8 Quantization
./quantized.out <float.mnn> <quant_int8.mnn> <config.json>
Source Definitions
auto_quant.py: mainFunction
From tools/converter/tools/auto_quant.py (lines 216-225):
def mainFunction():
parser = argparse.ArgumentParser(description='llm_exporter', formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument('--model', type=str, required=True, help='src float mnn model')
parser.add_argument('--quant_model', type=str, required=True, help='dst quant mnn model')
parser.add_argument('--test_dir', type=str, required=True, help='test dir')
parser.add_argument('--rate', type=float, default=0.05, help='test rate')
parser.add_argument('--select_bits', type=int, default=1, help='Try set layer as 4 bits')
parser.add_argument('--select_block', type=int, default=1, help='Try select blocks')
parser.add_argument('--hqq', type=int, default=1, help='Use HQQ method')
args = parser.parse_args()
auto_quant.py: findBestBits
From tools/converter/tools/auto_quant.py (lines 132-154):
def findBestBits(info, test, targetRate):
info.setBlock(64)
info.update()
rate = test.test()
if rate > targetRate:
return rate
length = info.mutableSize()
tested = False
for i in range(length):
info.setBits(i, 4)
info.update()
rate = test.test()
tested = True
if rate > targetRate:
# roll back to 8
info.setBits(i, 8)
tested = False
else:
print('Set %d layer to 4 bits' %i, ', rate=%f' %rate)
if not tested:
info.update()
rate = test.test()
return rate
auto_quant.py: findBestBlock
From tools/converter/tools/auto_quant.py (lines 156-197):
def findBestBlock(info, test, targetRate):
validBlock = 0
bestBlock = 256
bestRate = 1.0
rate = 1.0
for block in (256, 128, 64, 32):
info.setBlock(block)
info.update()
rate = test.test()
print('block=%d,' %block + ' rate=%f' %rate)
if rate < bestRate:
bestRate = rate
bestBlock = block
if rate < targetRate:
validBlock = block
break
# ... binary search for mixed block configuration
quantized.out: quant_main
From tools/quantization/calibration.cpp (lines 1661-1670):
int quant_main(int argc, const char* argv[]) {
if (argc < 4) {
DLOG(INFO) << "Usage: ./quantized.out src.mnn dst.mnn preTreatConfig.json\n";
return 0;
}
const char* modelFile = argv[1];
const char* preTreatConfig = argv[3];
const char* dstFile = argv[2];
// ... creates Calibration instance and runs quantization
quantized.cpp Entry Point
From tools/quantization/quantized.cpp (lines 14-16):
int main(int argc, const char* argv[]) {
return quant_main(argc, argv);
}
Parameters
auto_quant.py Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--model |
string | (required) | Path to the source float MNN model. |
--quant_model |
string | (required) | Path for the output quantized MNN model. |
--test_dir |
string | (required) | Path to the test directory containing input.json, input data files, and expected output files for accuracy validation.
|
--rate |
float | 0.05 | Maximum allowed relative error rate. The auto-quant search ensures the compressed model stays below this threshold. A value of 0.05 means 5% maximum relative error. |
--select_bits |
int | 1 | When set to 1, enables the bit-width selection phase that tries setting layers to 4-bit quantization. Set to 0 to skip. |
--select_block |
int | 1 | When set to 1, enables the block size optimization phase that searches over block sizes (256, 128, 64, 32). Set to 0 to skip. |
--hqq |
int | 1 | When set to >0, enables HQQ (Half-Quadratic Quantization) method for improved accuracy during the search. Set to 0 to disable. |
quantized.out / Calibration JSON Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
feature_quantize_method |
string | "KL" | Method for computing feature (activation) quantization scales. Options: KL (KL-divergence, needs 100-1000 images), ADMM (optimization-based, needs one batch), EMA (exponential moving average, supports asymmetric quantization). |
weight_quantize_method |
string | "MAX_ABS" | Method for weight quantization. Options: MAX_ABS (symmetric, using max absolute value), ADMM (optimization-based). |
quant_bits |
int | 8 | Number of quantization bits for the INT8 model. |
path |
string | (required) | Directory containing calibration images. |
used_image_num |
int | all images in path | Number of calibration images to use. |
format |
string | (required) | Image format: "RGB", "BGR", "RGBA", or "GRAY". |
mean |
float[] | (required) | Per-channel mean values for normalization: dst = (src - mean) * normal.
|
normal |
float[] | (required) | Per-channel scale values for normalization. |
width |
int | (required) | Model input width. |
height |
int | (required) | Model input height. |
feature_clamp_value |
int | 127 | Feature quantization range [-v, v]. Reduce to mitigate overflow errors. |
weight_clamp_value |
int | 127 | Weight quantization range. Adjusting feature_clamp_value is generally preferred.
|
batch_size |
int | 32 | Batch size for EMA calibration method. Should match training batch size. |
skip_quant_op_names |
string[] | [] | Op names to exclude from quantization (e.g., the first conv layer). |
debug |
bool | false | When true, outputs per-layer cosine distance and overflow rate between float and quantized models. |
Inputs
For auto_quant.py
- Float MNN model -- The uncompressed source model.
- Test directory -- A directory containing:
input.json-- Describes input/output tensor names and shapes.input0.txt,input1.txt, ... -- Input tensor data files.output.txt-- Expected output tensor data.
For quantized.out
- Float MNN model -- The uncompressed source model.
- Calibration JSON -- A JSON configuration file specifying preprocessing parameters and calibration settings.
- Calibration images -- 100-1000 representative images in the directory specified by the JSON
pathfield.
Outputs
From auto_quant.py
- Optimized quantized model (
quant_model) -- The weight-quantized MNN model with per-layer optimized bit-widths and block sizes. - Compression params JSON (
quant_model.json) -- A JSON file recording the quantization configuration for each layer, including bit-width, block size, and asymmetric flag. This file can be manually edited and re-applied withMNNConvert --compressionParamsFile. - Compression report -- Printed to stdout, showing the final error rate and size reduction (original MB to compressed MB).
From quantized.out
- INT8 quantized model -- A fully quantized MNN model where both weights and activations use INT8 arithmetic.
Usage Examples
Auto-Quant Workflow
# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn
# Step 2: Prepare test directory
# mnntest/
# input.json
# input0.txt
# output.txt
# Step 3: Run auto-quant search
python tools/converter/tools/auto_quant.py \
--model float.mnn \
--quant_model auto_quant.mnn \
--test_dir mnntest \
--rate 0.05 \
--hqq 1
# Step 4 (optional): Re-apply with manual adjustments
./MNNConvert -f MNN --modelFile float.mnn --MNNModel final.mnn \
--compressionParamsFile auto_quant.mnn.json --hqq
Offline INT8 Quantization Workflow
# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn
# Step 2: Create calibration config
cat > quant_config.json << 'ENDJSON'
{
"format": "RGB",
"mean": [103.94, 116.78, 123.68],
"normal": [0.017, 0.017, 0.017],
"width": 224,
"height": 224,
"path": "./calibration_images/",
"used_image_num": 500,
"feature_quantize_method": "KL",
"weight_quantize_method": "MAX_ABS"
}
ENDJSON
# Step 3: Run offline quantization
./quantized.out float.mnn quant_int8.mnn quant_config.json
Related Pages
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment