Heuristic:Tencent Ncnn Optimize Before Quantize
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Quantization |
| Last Updated | 2026-02-09 19:00 GMT |
Overview
Model optimization sequence rule: always run ncnnoptimize on fp32 models before quantization to ensure graph fusion simplifies the network and improves int8 calibration accuracy.
Description
The ncnn quantization pipeline requires a specific ordering: first run `ncnnoptimize` to fuse operations (BatchNorm into Conv, Conv+ReLU fusion, etc.) and convert to fp16-storage format, then prepare calibration data, then run `ncnn2table` for calibration, and finally run `ncnn2int8` to produce the quantized model. Running quantization on an unoptimized model produces suboptimal results because unfused operations create unnecessary quantize/dequantize boundaries and the calibration statistics are less accurate on redundant operations.
Usage
Use this heuristic whenever performing post-training quantization on an ncnn model. The correct sequence is: (1) ncnnoptimize, (2) prepare calibration images, (3) ncnn2table, (4) ncnn2int8. Never skip the optimization step.
The Insight (Rule of Thumb)
- Action: Always run `ncnnoptimize model.param model.bin model-opt.param model-opt.bin 65536` before quantization.
- Value: The `65536` flag enables fp16 weight storage in the optimized model, reducing model size by 50%.
- Trade-off: The optimization step adds build time but is essential for both model size reduction and quantization quality.
- Sequence: ncnnoptimize (fp16) -> ncnn2table (calibration) -> ncnn2int8 (quantize). Skipping step 1 degrades accuracy.
Reasoning
Graph optimization fuses patterns like Conv+BatchNorm+ReLU into a single operation. Without fusion, the quantization calibration must account for intermediate precision boundaries at each unfused operation, leading to more quantization error accumulation. The fused graph has fewer operations and more predictable activation distributions, making the KL-divergence or ACIQ calibration algorithms more effective. The fp16 flag (`65536`) halves the model binary size, which is valuable for mobile deployment but does not affect inference accuracy (weights are converted back to fp32/int8 at load time).
Documented workflow from ncnn quantization guide:
# Step 1: Optimize the fp32 model
ncnnoptimize model.param model.bin model-opt.param model-opt.bin 65536
# Step 2: Generate calibration table
ncnn2table model-opt.param model-opt.bin imagelist.txt model.table \
mean=[104,117,123] norm=[1,1,1] shape=[224,224,3] \
pixel=BGR thread=8 method=kl
# Step 3: Apply quantization
ncnn2int8 model-opt.param model-opt.bin model-int8.param model-int8.bin model.table