Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Alibaba MNN Auto Quant Validation

From Leeroopedia


Field Value
Implementation Name Auto_Quant_Validation
Type API Doc
Topic Model_Compression
Workflow Model_Compression
Description Automated quantization parameter search and offline INT8 calibration tools for compression validation
Source File(s) tools/converter/tools/auto_quant.py:L132-225, tools/quantization/calibration.cpp:L1661, tools/quantization/quantized.cpp:L14-16
Last Updated 2026-02-10 14:00 GMT

API Signatures

Automated Weight Quantization Tuning

python auto_quant.py --model <float.mnn> --quant_model <quant.mnn> --test_dir <dir> --rate 0.05 [--select_bits 1] [--select_block 1] [--hqq 1]

Offline INT8 Quantization

./quantized.out <float.mnn> <quant_int8.mnn> <config.json>

Source Definitions

auto_quant.py: mainFunction

From tools/converter/tools/auto_quant.py (lines 216-225):

def mainFunction():
    parser = argparse.ArgumentParser(description='llm_exporter', formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('--model', type=str, required=True, help='src float mnn model')
    parser.add_argument('--quant_model', type=str, required=True, help='dst quant mnn model')
    parser.add_argument('--test_dir', type=str, required=True, help='test dir')
    parser.add_argument('--rate', type=float, default=0.05, help='test rate')
    parser.add_argument('--select_bits', type=int, default=1, help='Try set layer as 4 bits')
    parser.add_argument('--select_block', type=int, default=1, help='Try select blocks')
    parser.add_argument('--hqq', type=int, default=1, help='Use HQQ method')
    args = parser.parse_args()

auto_quant.py: findBestBits

From tools/converter/tools/auto_quant.py (lines 132-154):

def findBestBits(info, test, targetRate):
    info.setBlock(64)
    info.update()
    rate = test.test()
    if rate > targetRate:
        return rate
    length = info.mutableSize()
    tested = False
    for i in range(length):
        info.setBits(i, 4)
        info.update()
        rate = test.test()
        tested = True
        if rate > targetRate:
            # roll back to 8
            info.setBits(i, 8)
            tested = False
        else:
            print('Set %d layer to 4 bits' %i, ', rate=%f' %rate)
    if not tested:
        info.update()
        rate = test.test()
    return rate

auto_quant.py: findBestBlock

From tools/converter/tools/auto_quant.py (lines 156-197):

def findBestBlock(info, test, targetRate):
    validBlock = 0
    bestBlock = 256
    bestRate = 1.0
    rate = 1.0
    for block in (256, 128, 64, 32):
        info.setBlock(block)
        info.update()
        rate = test.test()
        print('block=%d,' %block + ' rate=%f' %rate)
        if rate < bestRate:
            bestRate = rate
            bestBlock = block
        if rate < targetRate:
            validBlock = block
            break
    # ... binary search for mixed block configuration

quantized.out: quant_main

From tools/quantization/calibration.cpp (lines 1661-1670):

int quant_main(int argc, const char* argv[]) {
    if (argc < 4) {
        DLOG(INFO) << "Usage: ./quantized.out src.mnn dst.mnn preTreatConfig.json\n";
        return 0;
    }
    const char* modelFile      = argv[1];
    const char* preTreatConfig = argv[3];
    const char* dstFile        = argv[2];
    // ... creates Calibration instance and runs quantization

quantized.cpp Entry Point

From tools/quantization/quantized.cpp (lines 14-16):

int main(int argc, const char* argv[]) {
    return quant_main(argc, argv);
}

Parameters

auto_quant.py Parameters

Parameter Type Default Description
--model string (required) Path to the source float MNN model.
--quant_model string (required) Path for the output quantized MNN model.
--test_dir string (required) Path to the test directory containing input.json, input data files, and expected output files for accuracy validation.
--rate float 0.05 Maximum allowed relative error rate. The auto-quant search ensures the compressed model stays below this threshold. A value of 0.05 means 5% maximum relative error.
--select_bits int 1 When set to 1, enables the bit-width selection phase that tries setting layers to 4-bit quantization. Set to 0 to skip.
--select_block int 1 When set to 1, enables the block size optimization phase that searches over block sizes (256, 128, 64, 32). Set to 0 to skip.
--hqq int 1 When set to >0, enables HQQ (Half-Quadratic Quantization) method for improved accuracy during the search. Set to 0 to disable.

quantized.out / Calibration JSON Parameters

Parameter Type Default Description
feature_quantize_method string "KL" Method for computing feature (activation) quantization scales. Options: KL (KL-divergence, needs 100-1000 images), ADMM (optimization-based, needs one batch), EMA (exponential moving average, supports asymmetric quantization).
weight_quantize_method string "MAX_ABS" Method for weight quantization. Options: MAX_ABS (symmetric, using max absolute value), ADMM (optimization-based).
quant_bits int 8 Number of quantization bits for the INT8 model.
path string (required) Directory containing calibration images.
used_image_num int all images in path Number of calibration images to use.
format string (required) Image format: "RGB", "BGR", "RGBA", or "GRAY".
mean float[] (required) Per-channel mean values for normalization: dst = (src - mean) * normal.
normal float[] (required) Per-channel scale values for normalization.
width int (required) Model input width.
height int (required) Model input height.
feature_clamp_value int 127 Feature quantization range [-v, v]. Reduce to mitigate overflow errors.
weight_clamp_value int 127 Weight quantization range. Adjusting feature_clamp_value is generally preferred.
batch_size int 32 Batch size for EMA calibration method. Should match training batch size.
skip_quant_op_names string[] [] Op names to exclude from quantization (e.g., the first conv layer).
debug bool false When true, outputs per-layer cosine distance and overflow rate between float and quantized models.

Inputs

For auto_quant.py

  • Float MNN model -- The uncompressed source model.
  • Test directory -- A directory containing:
    • input.json -- Describes input/output tensor names and shapes.
    • input0.txt, input1.txt, ... -- Input tensor data files.
    • output.txt -- Expected output tensor data.

For quantized.out

  • Float MNN model -- The uncompressed source model.
  • Calibration JSON -- A JSON configuration file specifying preprocessing parameters and calibration settings.
  • Calibration images -- 100-1000 representative images in the directory specified by the JSON path field.

Outputs

From auto_quant.py

  • Optimized quantized model (quant_model) -- The weight-quantized MNN model with per-layer optimized bit-widths and block sizes.
  • Compression params JSON (quant_model.json) -- A JSON file recording the quantization configuration for each layer, including bit-width, block size, and asymmetric flag. This file can be manually edited and re-applied with MNNConvert --compressionParamsFile.
  • Compression report -- Printed to stdout, showing the final error rate and size reduction (original MB to compressed MB).

From quantized.out

  • INT8 quantized model -- A fully quantized MNN model where both weights and activations use INT8 arithmetic.

Usage Examples

Auto-Quant Workflow

# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn

# Step 2: Prepare test directory
# mnntest/
#   input.json
#   input0.txt
#   output.txt

# Step 3: Run auto-quant search
python tools/converter/tools/auto_quant.py \
    --model float.mnn \
    --quant_model auto_quant.mnn \
    --test_dir mnntest \
    --rate 0.05 \
    --hqq 1

# Step 4 (optional): Re-apply with manual adjustments
./MNNConvert -f MNN --modelFile float.mnn --MNNModel final.mnn \
    --compressionParamsFile auto_quant.mnn.json --hqq

Offline INT8 Quantization Workflow

# Step 1: Convert source model to float MNN
./MNNConvert -f ONNX --modelFile model.onnx --MNNModel float.mnn

# Step 2: Create calibration config
cat > quant_config.json << 'ENDJSON'
{
    "format": "RGB",
    "mean": [103.94, 116.78, 123.68],
    "normal": [0.017, 0.017, 0.017],
    "width": 224,
    "height": 224,
    "path": "./calibration_images/",
    "used_image_num": 500,
    "feature_quantize_method": "KL",
    "weight_quantize_method": "MAX_ABS"
}
ENDJSON

# Step 3: Run offline quantization
./quantized.out float.mnn quant_int8.mnn quant_config.json

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment