Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:InternLM Lmdeploy SmoothQuant Quantization

From Leeroopedia


Knowledge Sources
Domains Model_Compression, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

A weight-activation co-quantization algorithm that achieves 8-bit inference (W8A8) by mathematically smoothing activation outliers into the weight matrices before quantization.

Description

SmoothQuant enables simultaneous quantization of both weights and activations to 8-bit (INT8 or FP8), achieving significant speedup through hardware INT8/FP8 matrix multiplication. The challenge is that activations often contain large outliers that make direct quantization lossy.

The key insight is to apply a per-channel smoothing transformation that migrates the quantization difficulty from activations (which have outliers) to weights (which are more uniform):

Y=(Xdiag(s)1)(diag(s)W)=X^W^

Where s is a smoothing factor derived from calibration data. After smoothing, both X^ and W^ are easier to quantize.

LMDeploy supports both INT8 and FP8 (float8_e4m3fn, float8_e5m2) quantization formats. SmoothQuant models require the PyTorch backend.

Usage

Use SmoothQuant when you need W8A8 inference speedup without the larger quality loss of W4A16 quantization. Best for latency-sensitive deployments where 4-bit quantization is too aggressive. Requires calibration data and uses the PyTorch backend.

Theoretical Basis

The smoothing factor balances quantization difficulty between activations and weights:

sj=max(|Xj|)αmax(|Wj|)1α

Where α (typically 0.5) controls the migration strength. Higher α pushes more quantization difficulty to weights.

After smoothing, standard per-tensor or per-channel symmetric quantization is applied:

Q(x)=round(xscale),scale=max(|x|)2b11

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment