Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Tencent Ncnn Int8 Model Quantization

From Leeroopedia


Knowledge Sources
Domains Quantization, Model_Optimization, Model_Deployment
Last Updated 2026-02-09 00:00 GMT

Overview

Process of converting a neural network's float32 weights and activations to int8 representation using calibrated scale factors, reducing model size and enabling integer arithmetic acceleration.

Description

Post-training int8 quantization converts float32 model weights to int8 using scale factors computed during calibration. This reduces model size by approximately 4x and enables hardware-accelerated integer arithmetic (e.g., ARM dot product instructions, x86 VNNI), providing significant inference speedup on supported platforms.

The quantization process: (1) reads the calibration table with per-layer scale factors, (2) quantizes weights of supported layer types (Conv, DepthwiseConv, InnerProduct, RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA) from float32 to int8, (3) leaves non-quantizable layers in their original precision, and (4) writes the quantized model.

The quantized model uses the same inference API as float32 — ncnn automatically detects quantized weights and uses the int8 execution path when opt.use_int8_inference is true (default).

Usage

Use as the final step in the quantization pipeline, after calibration table generation. Deploy the quantized model for inference on resource-constrained devices. Compare accuracy against the float32 baseline to ensure acceptable quality.

Theoretical Basis

Symmetric quantization: xint8=round(xfp32scale) xfp32xint8×scale

Where the scale factor is determined by the calibration process to minimize quantization error.

Supported layer quantization:

Quantizable layers: Conv, DepthwiseConv, InnerProduct,
    RNN, LSTM, GRU, Embed, Gemm, MultiHeadAttention, SDPA
Non-quantizable: Softmax, Sigmoid, pooling, etc. (remain fp16/fp32)

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment