Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:InternLM Lmdeploy W8A8 Quantized Inference

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

An inference pattern for running SmoothQuant (W8A8) quantized models through the PyTorch backend with INT8 or FP8 computation.

Description

W8A8 Quantized Inference deploys SmoothQuant-quantized models using the PyTorch backend. Unlike AWQ/GPTQ models that use TurboMind, SmoothQuant models require the PyTorch backend because:

  • INT8/FP8 matrix multiplication is handled by PyTorch's native or custom kernels
  • The SmoothQuant weight format is not supported by the TurboMind C++ engine
  • The PyTorch backend provides broader device support for quantized inference

The inference API is identical to full-precision models—only the backend configuration changes.

Usage

Use PytorchEngineConfig (not TurbomindEngineConfig) when deploying SmoothQuant models. The model format is auto-detected from the model's quantization_config.

Theoretical Basis

W8A8 inference computes: Y=dequant(quant(X)×quant(W))

Using hardware INT8 or FP8 matrix multiply units for the inner product, then dequantizing to FP16 for accumulation and non-linear operations.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment