Principle:InternLM Lmdeploy W8A8 Quantized Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Quantization |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
An inference pattern for running SmoothQuant (W8A8) quantized models through the PyTorch backend with INT8 or FP8 computation.
Description
W8A8 Quantized Inference deploys SmoothQuant-quantized models using the PyTorch backend. Unlike AWQ/GPTQ models that use TurboMind, SmoothQuant models require the PyTorch backend because:
- INT8/FP8 matrix multiplication is handled by PyTorch's native or custom kernels
- The SmoothQuant weight format is not supported by the TurboMind C++ engine
- The PyTorch backend provides broader device support for quantized inference
The inference API is identical to full-precision models—only the backend configuration changes.
Usage
Use PytorchEngineConfig (not TurbomindEngineConfig) when deploying SmoothQuant models. The model format is auto-detected from the model's quantization_config.
Theoretical Basis
W8A8 inference computes:
Using hardware INT8 or FP8 matrix multiply units for the inner product, then dequantizing to FP16 for accumulation and non-linear operations.