Principle:InternLM Lmdeploy SmoothQuant Quantization
| Knowledge Sources | |
|---|---|
| Domains | Model_Compression, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A weight-activation co-quantization algorithm that achieves 8-bit inference (W8A8) by mathematically smoothing activation outliers into the weight matrices before quantization.
Description
SmoothQuant enables simultaneous quantization of both weights and activations to 8-bit (INT8 or FP8), achieving significant speedup through hardware INT8/FP8 matrix multiplication. The challenge is that activations often contain large outliers that make direct quantization lossy.
The key insight is to apply a per-channel smoothing transformation that migrates the quantization difficulty from activations (which have outliers) to weights (which are more uniform):
Where s is a smoothing factor derived from calibration data. After smoothing, both and are easier to quantize.
LMDeploy supports both INT8 and FP8 (float8_e4m3fn, float8_e5m2) quantization formats. SmoothQuant models require the PyTorch backend.
Usage
Use SmoothQuant when you need W8A8 inference speedup without the larger quality loss of W4A16 quantization. Best for latency-sensitive deployments where 4-bit quantization is too aggressive. Requires calibration data and uses the PyTorch backend.
Theoretical Basis
The smoothing factor balances quantization difficulty between activations and weights:
Where (typically 0.5) controls the migration strength. Higher pushes more quantization difficulty to weights.
After smoothing, standard per-tensor or per-channel symmetric quantization is applied: