Workflow:Tencent Ncnn Post Training Quantization

Knowledge Sources	ncnn Quantized Int8 Inference Guide ncnnoptimize Guide
Domains	Quantization, Model_Optimization, Edge_Deployment
Last Updated	2026-02-09 19:00 GMT

Overview

End-to-end process for converting a float32 ncnn model to an int8 quantized model using post-training quantization for efficient mobile deployment.

Description

This workflow applies post-training quantization (PTQ) to an existing ncnn float32 model to produce an int8 quantized model. The quantized model uses 8-bit integer arithmetic for compute-intensive layers, significantly reducing model size and improving inference speed on mobile and embedded CPUs. The process involves optimizing the model graph, running calibration with representative data to determine optimal quantization scales, and applying the quantization to produce final int8 model files.

Key outcomes:

An int8-quantized ncnn model (.param + .bin) with reduced size and faster inference
A calibration table file recording per-layer quantization scales
Support for mixed-precision inference by selectively excluding layers from quantization

Usage

Execute this workflow when you have a float32 ncnn model that needs to be deployed on resource-constrained mobile or embedded devices where inference speed and model size are critical. You need a representative calibration dataset (ideally 5000+ images from the validation set) to generate accurate quantization scales.

Execution Steps

Step 1: Optimize the Float32 Model

Run ncnnoptimize on the original ncnn model to apply graph-level optimizations before quantization. This fuses operators (Convolution+BatchNorm, Convolution+ReLU, etc.), eliminates no-op layers, and produces a cleaner graph that quantizes more effectively.

Key considerations:

If the model was already converted via PNNX, this step can be skipped as PNNX applies these optimizations
Use the flag 0 to keep fp32 weights (do not convert to fp16 before quantization)
The optimized model is saved as a new .param/.bin pair

Step 2: Prepare Calibration Dataset

Assemble a representative dataset of input samples for calibration. Create a text file listing the paths to all calibration images (or .npy files for non-image inputs). The calibration data should reflect the distribution of real-world inputs the model will encounter.

Key considerations:

Use at least 5000 images from the validation dataset for best results
For image inputs, create an image list file using find images/ -type f > imagelist.txt
For non-image inputs, prepare .npy files with the same preprocessing as training
Multiple input nodes require separate list files, comma-separated

Step 3: Generate Calibration Table

Run ncnn2table to compute per-layer quantization scale factors. The tool feeds calibration data through the float32 model, collects activation distributions per layer, and computes optimal scale factors using either KL-divergence or ACIQ algorithms.

Key considerations:

Specify mean, norm, shape, and pixel parameters matching the model's preprocessing
method=kl uses KL-divergence minimization (recommended for most models)
method=aciq uses Analytical Clipping for Integer Quantization
thread controls the number of CPU threads for parallel calibration
The output is a .table file containing per-layer weight and activation scales

Step 4: Apply Quantization to Model

Run ncnn2int8 with the optimized model and calibration table to produce the final int8 quantized model. The tool converts applicable layers from float32 to int8, embedding the quantization parameters into the model.

Key considerations:

The output is a new .param/.bin pair with int8 layers
Layers not in the calibration table remain in float32
For mixed precision, comment out specific layer lines in the .table file (prefix with #) to keep those layers in float32
RNN/LSTM/GRU layers support dynamic quantization without a calibration table

Step 5: Validate Quantized Model

Load the int8 model using the standard ncnn inference API and verify that accuracy meets requirements. Compare outputs against the float32 model on a validation set. The ncnn library automatically uses int8 inference for quantized layers with no code changes required.

Key considerations:

No changes to inference code are needed; ncnn detects int8 layers automatically
Monitor accuracy degradation; if excessive, use mixed precision by reverting sensitive layers to float32
Benchmark inference speed improvement on the target device

Execution Diagram

GitHub URL

Workflow Repository