Implementation:Alibaba MNN Auto Quant Validation

Field	Value
Implementation Name	Auto_Quant_Validation
Type	API Doc
Topic	Model_Compression
Workflow	Model_Compression
Description	Automated quantization parameter search and offline INT8 calibration tools for compression validation
Source File(s)	`tools/converter/tools/auto_quant.py:L132-225`, `tools/quantization/calibration.cpp:L1661`, `tools/quantization/quantized.cpp:L14-16`
Last Updated	2026-02-10 14:00 GMT

API Signatures

Automated Weight Quantization Tuning

python auto_quant.py --model <float.mnn> --quant_model <quant.mnn> --test_dir <dir> --rate 0.05 [--select_bits 1] [--select_block 1] [--hqq 1]

Offline INT8 Quantization

./quantized.out <float.mnn> <quant_int8.mnn> <config.json>

Source Definitions

auto_quant.py: mainFunction

From tools/converter/tools/auto_quant.py (lines 216-225):

def mainFunction():
    parser = argparse.ArgumentParser(description='llm_exporter', formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('--model', type=str, required=True, help='src float mnn model')
    parser.add_argument('--quant_model', type=str, required=True, help='dst quant mnn model')
    parser.add_argument('--test_dir', type=str, required=True, help='test dir')
    parser.add_argument('--rate', type=float, default=0.05, help='test rate')
    parser.add_argument('--select_bits', type=int, default=1, help='Try set layer as 4 bits')
    parser.add_argument('--select_block', type=int, default=1, help='Try select blocks')
    parser.add_argument('--hqq', type=int, default=1, help='Use HQQ method')
    args = parser.parse_args()

auto_quant.py: findBestBits

From tools/converter/tools/auto_quant.py (lines 132-154):

def findBestBits(info, test, targetRate):
    info.setBlock(64)
    info.update()
    rate = test.test()
    if rate > targetRate:
        return rate
    length = info.mutableSize()
    tested = False
    for i in range(length):
        info.setBits(i, 4)
        info.update()
        rate = test.test()
        tested = True
        if rate > targetRate:
            # roll back to 8
            info.setBits(i, 8)
            tested = False
        else:
            print('Set %d layer to 4 bits' %i, ', rate=%f' %rate)
    if not tested:
        info.update()
        rate = test.test()
    return rate

auto_quant.py: findBestBlock

From tools/converter/tools/auto_quant.py (lines 156-197):

def findBestBlock(info, test, targetRate):
    validBlock = 0
    bestBlock = 256
    bestRate = 1.0
    rate = 1.0
    for block in (256, 128, 64, 32):
        info.setBlock(block)
        info.update()
        rate = test.test()
        print('block=%d,' %block + ' rate=%f' %rate)
        if rate < bestRate:
            bestRate = rate
            bestBlock = block
        if rate < targetRate:
            validBlock = block
            break
    # ... binary search for mixed block configuration

quantized.out: quant_main

From tools/quantization/calibration.cpp (lines 1661-1670):

int quant_main(int argc, const char* argv[]) {
    if (argc < 4) {
        DLOG(INFO) << "Usage: ./quantized.out src.mnn dst.mnn preTreatConfig.json\n";
        return 0;
    }
    const char* modelFile      = argv[1];
    const char* preTreatConfig = argv[3];
    const char* dstFile        = argv[2];
    // ... creates Calibration instance and runs quantization

quantized.cpp Entry Point

From tools/quantization/quantized.cpp (lines 14-16):

int main(int argc, const char* argv[]) {
    return quant_main(argc, argv);
}

Parameters

auto_quant.py Parameters

Parameter	Type	Default	Description
`--model`	string	(required)	Path to the source float MNN model.
`--quant_model`	string	(required)	Path for the output quantized MNN model.
`--test_dir`	string	(required)	Path to the test directory containing `input.json`, input data files, and expected output files for accuracy validation.
`--rate`	float	0.05	Maximum allowed relative error rate. The auto-quant search ensures the compressed model stays below this threshold. A value of 0.05 means 5% maximum relative error.
`--select_bits`	int	1	When set to 1, enables the bit-width selection phase that tries setting layers to 4-bit quantization. Set to 0 to skip.
`--select_block`	int	1	When set to 1, enables the block size optimization phase that searches over block sizes (256, 128, 64, 32). Set to 0 to skip.
`--hqq`	int	1	When set to >0, enables HQQ (Half-Quadratic Quantization) method for improved accuracy during the search. Set to 0 to disable.

quantized.out / Calibration JSON Parameters

Parameter	Type	Default	Description
`feature_quantize_method`	string	"KL"	Method for computing feature (activation) quantization scales. Options: KL (KL-divergence, needs 100-1000 images), ADMM (optimization-based, needs one batch), EMA (exponential moving average, supports asymmetric quantization).
`weight_quantize_method`	string	"MAX_ABS"	Method for weight quantization. Options: MAX_ABS (symmetric, using max absolute value), ADMM (optimization-based).
`quant_bits`	int	8	Number of quantization bits for the INT8 model.
`path`	string	(required)	Directory containing calibration images.
`used_image_num`	int	all images in path	Number of calibration images to use.
`format`	string	(required)	Image format: "RGB", "BGR", "RGBA", or "GRAY".
`mean`	float[]	(required)	Per-channel mean values for normalization: `dst = (src - mean) * normal`.
`normal`	float[]	(required)	Per-channel scale values for normalization.
`width`	int	(required)	Model input width.
`height`	int	(required)	Model input height.
`feature_clamp_value`	int	127	Feature quantization range [-v, v]. Reduce to mitigate overflow errors.
`weight_clamp_value`	int	127	Weight quantization range. Adjusting `feature_clamp_value` is generally preferred.
`batch_size`	int	32	Batch size for EMA calibration method. Should match training batch size.
`skip_quant_op_names`	string[]	[]	Op names to exclude from quantization (e.g., the first conv layer).
`debug`	bool	false	When true, outputs per-layer cosine distance and overflow rate between float and quantized models.

Inputs

For auto_quant.py

Float MNN model -- The uncompressed source model.
Test directory -- A directory containing:
- input.json -- Describes input/output tensor names and shapes.
- input0.txt, input1.txt, ... -- Input tensor data files.
- output.txt -- Expected output tensor data.

For quantized.out

Float MNN model -- The uncompressed source model.
Calibration JSON -- A JSON configuration file specifying preprocessing parameters and calibration settings.
Calibration images -- 100-1000 representative images in the directory specified by the JSON path field.

Outputs

From auto_quant.py

Optimized quantized model (quant_model) -- The weight-quantized MNN model with per-layer optimized bit-widths and block sizes.
Compression params JSON (quant_model.json) -- A JSON file recording the quantization configuration for each layer, including bit-width, block size, and asymmetric flag. This file can be manually edited and re-applied with MNNConvert --compressionParamsFile.
Compression report -- Printed to stdout, showing the final error rate and size reduction (original MB to compressed MB).

From quantized.out

INT8 quantized model -- A fully quantized MNN model where both weights and activations use INT8 arithmetic.

Usage Examples

Auto-Quant Workflow

# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn

# Step 2: Prepare test directory
# mnntest/
#   input.json
#   input0.txt
#   output.txt

# Step 3: Run auto-quant search
python tools/converter/tools/auto_quant.py \
    --model float.mnn \
    --quant_model auto_quant.mnn \
    --test_dir mnntest \
    --rate 0.05 \
    --hqq 1

# Step 4 (optional): Re-apply with manual adjustments
./MNNConvert -f MNN --modelFile float.mnn --MNNModel final.mnn \
    --compressionParamsFile auto_quant.mnn.json --hqq

Offline INT8 Quantization Workflow

# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn

# Step 2: Create calibration config
cat > quant_config.json << 'ENDJSON'
{
    "format": "RGB",
    "mean": [103.94, 116.78, 123.68],
    "normal": [0.017, 0.017, 0.017],
    "width": 224,
    "height": 224,
    "path": "./calibration_images/",
    "used_image_num": 500,
    "feature_quantize_method": "KL",
    "weight_quantize_method": "MAX_ABS"
}
ENDJSON

# Step 3: Run offline quantization
./quantized.out float.mnn quant_int8.mnn quant_config.json

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment