Workflow:Tencent Ncnn Post Training Quantization
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Optimization, Edge_Deployment |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
End-to-end process for converting a float32 ncnn model to an int8 quantized model using post-training quantization for efficient mobile deployment.
Description
This workflow applies post-training quantization (PTQ) to an existing ncnn float32 model to produce an int8 quantized model. The quantized model uses 8-bit integer arithmetic for compute-intensive layers, significantly reducing model size and improving inference speed on mobile and embedded CPUs. The process involves optimizing the model graph, running calibration with representative data to determine optimal quantization scales, and applying the quantization to produce final int8 model files.
Key outcomes:
- An int8-quantized ncnn model (.param + .bin) with reduced size and faster inference
- A calibration table file recording per-layer quantization scales
- Support for mixed-precision inference by selectively excluding layers from quantization
Usage
Execute this workflow when you have a float32 ncnn model that needs to be deployed on resource-constrained mobile or embedded devices where inference speed and model size are critical. You need a representative calibration dataset (ideally 5000+ images from the validation set) to generate accurate quantization scales.
Execution Steps
Step 1: Optimize the Float32 Model
Run ncnnoptimize on the original ncnn model to apply graph-level optimizations before quantization. This fuses operators (Convolution+BatchNorm, Convolution+ReLU, etc.), eliminates no-op layers, and produces a cleaner graph that quantizes more effectively.
Key considerations:
- If the model was already converted via PNNX, this step can be skipped as PNNX applies these optimizations
- Use the flag 0 to keep fp32 weights (do not convert to fp16 before quantization)
- The optimized model is saved as a new .param/.bin pair
Step 2: Prepare Calibration Dataset
Assemble a representative dataset of input samples for calibration. Create a text file listing the paths to all calibration images (or .npy files for non-image inputs). The calibration data should reflect the distribution of real-world inputs the model will encounter.
Key considerations:
- Use at least 5000 images from the validation dataset for best results
- For image inputs, create an image list file using find images/ -type f > imagelist.txt
- For non-image inputs, prepare .npy files with the same preprocessing as training
- Multiple input nodes require separate list files, comma-separated
Step 3: Generate Calibration Table
Run ncnn2table to compute per-layer quantization scale factors. The tool feeds calibration data through the float32 model, collects activation distributions per layer, and computes optimal scale factors using either KL-divergence or ACIQ algorithms.
Key considerations:
- Specify mean, norm, shape, and pixel parameters matching the model's preprocessing
- method=kl uses KL-divergence minimization (recommended for most models)
- method=aciq uses Analytical Clipping for Integer Quantization
- thread controls the number of CPU threads for parallel calibration
- The output is a .table file containing per-layer weight and activation scales
Step 4: Apply Quantization to Model
Run ncnn2int8 with the optimized model and calibration table to produce the final int8 quantized model. The tool converts applicable layers from float32 to int8, embedding the quantization parameters into the model.
Key considerations:
- The output is a new .param/.bin pair with int8 layers
- Layers not in the calibration table remain in float32
- For mixed precision, comment out specific layer lines in the .table file (prefix with #) to keep those layers in float32
- RNN/LSTM/GRU layers support dynamic quantization without a calibration table
Step 5: Validate Quantized Model
Load the int8 model using the standard ncnn inference API and verify that accuracy meets requirements. Compare outputs against the float32 model on a validation set. The ncnn library automatically uses int8 inference for quantized layers with no code changes required.
Key considerations:
- No changes to inference code are needed; ncnn detects int8 layers automatically
- Monitor accuracy degradation; if excessive, use mixed precision by reverting sensitive layers to float32
- Benchmark inference speed improvement on the target device