Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Tencent Ncnn Post Training Quantization

From Leeroopedia


Knowledge Sources
Domains Quantization, Model_Optimization, Edge_Deployment
Last Updated 2026-02-09 19:00 GMT

Overview

End-to-end process for converting a float32 ncnn model to an int8 quantized model using post-training quantization for efficient mobile deployment.

Description

This workflow applies post-training quantization (PTQ) to an existing ncnn float32 model to produce an int8 quantized model. The quantized model uses 8-bit integer arithmetic for compute-intensive layers, significantly reducing model size and improving inference speed on mobile and embedded CPUs. The process involves optimizing the model graph, running calibration with representative data to determine optimal quantization scales, and applying the quantization to produce final int8 model files.

Key outcomes:

  • An int8-quantized ncnn model (.param + .bin) with reduced size and faster inference
  • A calibration table file recording per-layer quantization scales
  • Support for mixed-precision inference by selectively excluding layers from quantization

Usage

Execute this workflow when you have a float32 ncnn model that needs to be deployed on resource-constrained mobile or embedded devices where inference speed and model size are critical. You need a representative calibration dataset (ideally 5000+ images from the validation set) to generate accurate quantization scales.

Execution Steps

Step 1: Optimize the Float32 Model

Run ncnnoptimize on the original ncnn model to apply graph-level optimizations before quantization. This fuses operators (Convolution+BatchNorm, Convolution+ReLU, etc.), eliminates no-op layers, and produces a cleaner graph that quantizes more effectively.

Key considerations:

  • If the model was already converted via PNNX, this step can be skipped as PNNX applies these optimizations
  • Use the flag 0 to keep fp32 weights (do not convert to fp16 before quantization)
  • The optimized model is saved as a new .param/.bin pair

Step 2: Prepare Calibration Dataset

Assemble a representative dataset of input samples for calibration. Create a text file listing the paths to all calibration images (or .npy files for non-image inputs). The calibration data should reflect the distribution of real-world inputs the model will encounter.

Key considerations:

  • Use at least 5000 images from the validation dataset for best results
  • For image inputs, create an image list file using find images/ -type f > imagelist.txt
  • For non-image inputs, prepare .npy files with the same preprocessing as training
  • Multiple input nodes require separate list files, comma-separated

Step 3: Generate Calibration Table

Run ncnn2table to compute per-layer quantization scale factors. The tool feeds calibration data through the float32 model, collects activation distributions per layer, and computes optimal scale factors using either KL-divergence or ACIQ algorithms.

Key considerations:

  • Specify mean, norm, shape, and pixel parameters matching the model's preprocessing
  • method=kl uses KL-divergence minimization (recommended for most models)
  • method=aciq uses Analytical Clipping for Integer Quantization
  • thread controls the number of CPU threads for parallel calibration
  • The output is a .table file containing per-layer weight and activation scales

Step 4: Apply Quantization to Model

Run ncnn2int8 with the optimized model and calibration table to produce the final int8 quantized model. The tool converts applicable layers from float32 to int8, embedding the quantization parameters into the model.

Key considerations:

  • The output is a new .param/.bin pair with int8 layers
  • Layers not in the calibration table remain in float32
  • For mixed precision, comment out specific layer lines in the .table file (prefix with #) to keep those layers in float32
  • RNN/LSTM/GRU layers support dynamic quantization without a calibration table

Step 5: Validate Quantized Model

Load the int8 model using the standard ncnn inference API and verify that accuracy meets requirements. Compare outputs against the float32 model on a validation set. The ncnn library automatically uses int8 inference for quantized layers with no code changes required.

Key considerations:

  • No changes to inference code are needed; ncnn detects int8 layers automatically
  • Monitor accuracy degradation; if excessive, use mixed precision by reverting sensitive layers to float32
  • Benchmark inference speed improvement on the target device

Execution Diagram

GitHub URL

Workflow Repository