Principle:InternLM Lmdeploy W8A8 Quantized Inference

Knowledge Sources	LMDeploy W8A8 LMDeploy
Domains	LLM_Inference, Quantization
Last Updated	2026-02-07 15:00 GMT

Overview

An inference pattern for running SmoothQuant (W8A8) quantized models through the PyTorch backend with INT8 or FP8 computation.

Description

W8A8 Quantized Inference deploys SmoothQuant-quantized models using the PyTorch backend. Unlike AWQ/GPTQ models that use TurboMind, SmoothQuant models require the PyTorch backend because:

INT8/FP8 matrix multiplication is handled by PyTorch's native or custom kernels
The SmoothQuant weight format is not supported by the TurboMind C++ engine
The PyTorch backend provides broader device support for quantized inference

The inference API is identical to full-precision models—only the backend configuration changes.

Usage

Use PytorchEngineConfig (not TurbomindEngineConfig) when deploying SmoothQuant models. The model format is auto-detected from the model's quantization_config.

Theoretical Basis

W8A8 inference computes: $Y = dequant (quant (X) \times quant (W))$

Using hardware INT8 or FP8 matrix multiply units for the inner product, then dequantizing to FP16 for accumulation and non-linear operations.

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_Pipeline_Factory_Pytorch

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment