Principle:Tencent Ncnn Int8 Model Quantization

Knowledge Sources	ncnn Quantization and Training of Neural Networks
Domains	Quantization, Model_Optimization, Model_Deployment
Last Updated	2026-02-09 00:00 GMT

Overview

Process of converting a neural network's float32 weights and activations to int8 representation using calibrated scale factors, reducing model size and enabling integer arithmetic acceleration.

Description

Post-training int8 quantization converts float32 model weights to int8 using scale factors computed during calibration. This reduces model size by approximately 4x and enables hardware-accelerated integer arithmetic (e.g., ARM dot product instructions, x86 VNNI), providing significant inference speedup on supported platforms.

The quantization process: (1) reads the calibration table with per-layer scale factors, (2) quantizes weights of supported layer types (Conv, DepthwiseConv, InnerProduct, RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA) from float32 to int8, (3) leaves non-quantizable layers in their original precision, and (4) writes the quantized model.

The quantized model uses the same inference API as float32 — ncnn automatically detects quantized weights and uses the int8 execution path when opt.use_int8_inference is true (default).

Usage

Use as the final step in the quantization pipeline, after calibration table generation. Deploy the quantized model for inference on resource-constrained devices. Compare accuracy against the float32 baseline to ensure acceptable quality.

Theoretical Basis

Symmetric quantization: $x_{i n t 8} = round (\frac{x_{f p 32}}{s c a l e})$ $x_{f p 32} \approx x_{i n t 8} \times s c a l e$

Where the scale factor is determined by the calibration process to minimize quantization error.

Supported layer quantization:

Quantizable layers: Conv, DepthwiseConv, InnerProduct,
    RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA
Non-quantizable: Softmax, Sigmoid, pooling, etc. (remain fp16/fp32)

Related Pages

Implemented By

Implementation:Tencent_Ncnn_Ncnn2int8

Uses Heuristic

Heuristic:Tencent_Ncnn_Optimize_Before_Quantize

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment