Principle:Tencent Ncnn Int8 Model Quantization
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Model_Optimization, Model_Deployment |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Process of converting a neural network's float32 weights and activations to int8 representation using calibrated scale factors, reducing model size and enabling integer arithmetic acceleration.
Description
Post-training int8 quantization converts float32 model weights to int8 using scale factors computed during calibration. This reduces model size by approximately 4x and enables hardware-accelerated integer arithmetic (e.g., ARM dot product instructions, x86 VNNI), providing significant inference speedup on supported platforms.
The quantization process: (1) reads the calibration table with per-layer scale factors, (2) quantizes weights of supported layer types (Conv, DepthwiseConv, InnerProduct, RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA) from float32 to int8, (3) leaves non-quantizable layers in their original precision, and (4) writes the quantized model.
The quantized model uses the same inference API as float32 — ncnn automatically detects quantized weights and uses the int8 execution path when opt.use_int8_inference is true (default).
Usage
Use as the final step in the quantization pipeline, after calibration table generation. Deploy the quantized model for inference on resource-constrained devices. Compare accuracy against the float32 baseline to ensure acceptable quality.
Theoretical Basis
Symmetric quantization:
Where the scale factor is determined by the calibration process to minimize quantization error.
Supported layer quantization:
Quantizable layers: Conv, DepthwiseConv, InnerProduct,
RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA
Non-quantizable: Softmax, Sigmoid, pooling, etc. (remain fp16/fp32)