Principle:InternLM Lmdeploy SmoothQuant Quantization

Knowledge Sources	SmoothQuant LMDeploy W8A8 Quantization LMDeploy
Domains	Model_Compression, Quantization
Last Updated	2026-02-07 15:00 GMT

Overview

A weight-activation co-quantization algorithm that achieves 8-bit inference (W8A8) by mathematically smoothing activation outliers into the weight matrices before quantization.

Description

SmoothQuant enables simultaneous quantization of both weights and activations to 8-bit (INT8 or FP8), achieving significant speedup through hardware INT8/FP8 matrix multiplication. The challenge is that activations often contain large outliers that make direct quantization lossy.

The key insight is to apply a per-channel smoothing transformation that migrates the quantization difficulty from activations (which have outliers) to weights (which are more uniform):

$Y = (X \cdot diag (s)^{- 1}) \cdot (diag (s) \cdot W) = \hat{X} \cdot \hat{W}$

Where s is a smoothing factor derived from calibration data. After smoothing, both $\hat{X}$ and $\hat{W}$ are easier to quantize.

LMDeploy supports both INT8 and FP8 (float8_e4m3fn, float8_e5m2) quantization formats. SmoothQuant models require the PyTorch backend.

Usage

Use SmoothQuant when you need W8A8 inference speedup without the larger quality loss of W4A16 quantization. Best for latency-sensitive deployments where 4-bit quantization is too aggressive. Requires calibration data and uses the PyTorch backend.

Theoretical Basis

The smoothing factor balances quantization difficulty between activations and weights:

$s_{j} = \frac{\max (| X_{j} |)^{α}}{\max (| W_{j} |)^{1 - α}}$

Where $α$ (typically 0.5) controls the migration strength. Higher $α$ pushes more quantization difficulty to weights.

After smoothing, standard per-tensor or per-channel symmetric quantization is applied:

$Q (x) = round (\frac{x}{scale}), scale = \frac{\max (| x |)}{2^{b - 1} - 1}$

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Smooth_Quant

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment