Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:InternLM Lmdeploy W8A8 SmoothQuant Quantization

From Leeroopedia


Knowledge Sources
Domains LLM_Ops, Quantization, Inference
Last Updated 2026-02-07 15:00 GMT

Overview

End-to-end process for applying 8-bit weight-activation quantization (W8A8) using SmoothQuant to compress LLMs for efficient inference with LMDeploy's PyTorch engine.

Description

This workflow covers the complete pipeline from selecting a pre-trained model to quantizing both weights and activations to 8-bit precision using the SmoothQuant algorithm, then deploying the quantized model. SmoothQuant migrates quantization difficulty from activations to weights by applying mathematically equivalent per-channel scaling transformations, enabling effective INT8 or FP8 quantization of both components. The quantized model runs on the PyTorch backend, which replaces standard RMSNorm and Linear layers with quantization-aware variants (QRMSNorm, QLinear). LMDeploy supports both INT8 (Volta through Hopper GPUs) and FP8 (Ada Lovelace and Hopper GPUs).

Usage

Execute this workflow when you need INT8 or FP8 quantization that quantizes both weights and activations for maximum compute efficiency. W8A8 is particularly beneficial on modern GPUs with INT8/FP8 tensor cores (A100, H100, RTX 40-series) where the reduced precision directly translates to faster matrix multiplications. Choose INT8 for broad GPU compatibility or FP8 for maximum performance on Ada Lovelace and Hopper architectures.

Execution Steps

Step 1: Environment Setup

Install LMDeploy with all optional dependencies using pip install lmdeploy[all]. The SmoothQuant quantization module requires access to the full model weights and sufficient memory for the calibration and transformation process. Verify GPU compute capability matches the target quantization type (INT8: sm70+, FP8: sm89+).

Key considerations:

  • INT8 quantization is supported on Volta (V100) through Hopper (H100) GPUs
  • FP8 quantization requires Ada Lovelace (RTX 40-series) or Hopper (H100) GPUs
  • Use lmdeploy[all] to ensure all quantization dependencies are installed

Step 2: Weight Smoothing and Quantization

Run the SmoothQuant quantization using the lmdeploy lite smooth_quant CLI command with the source model path, output directory, and quantization dtype (int8 or fp8). The process performs three internal steps: (1) smooth the activation outliers by migrating quantization difficulty to weights via per-channel scaling, (2) replace model modules (RMSNorm, Linear) with their quantized counterparts (QRMSNorm, QLinear), and (3) save the transformed model.

What happens:

  • Model weights are loaded in full precision
  • Activation statistics are collected using calibration data
  • Per-channel smoothing factors are computed to balance weight and activation ranges
  • Smoothing transforms are applied: weights are scaled up, activations are scaled down
  • Standard modules are replaced with quantization-aware Q-variants
  • Quantized model is saved to the work directory with config metadata

Step 3: Quantization Validation

Verify the quantized model by running inference with a few test prompts using the PyTorch backend pipeline. Compare output quality with the original FP16 model to ensure acceptable accuracy preservation. For rigorous evaluation, use OpenCompass benchmarks.

Key considerations:

  • W8A8 quantized models run exclusively on the PyTorch backend
  • Quick manual testing catches severe quantization degradation
  • FP8 generally preserves more accuracy than INT8 due to better dynamic range

Step 4: Quantized Model Inference

Deploy the quantized model using the LMDeploy pipeline API with PytorchEngineConfig. The PyTorch engine automatically detects the quantized format from the saved model configuration and activates the INT8/FP8 compute kernels. The quantized model benefits from reduced memory bandwidth requirements and accelerated tensor core operations.

Key considerations:

  • Use PytorchEngineConfig (not TurbomindEngineConfig) for W8A8 models
  • Tensor parallelism is supported for multi-GPU deployment
  • Memory savings come from both weight compression and activation quantization

Step 5: Quantized Model Serving

Optionally deploy the quantized model as an OpenAI-compatible API server using lmdeploy serve api_server with --backend pytorch. This combines W8A8 efficiency with the standard serving infrastructure for production deployments.

Key considerations:

  • Pass --backend pytorch to the serve command for W8A8 models
  • All standard API server features work with W8A8 quantized models
  • FP8 serving on H100 GPUs provides the best throughput for W8A8 deployment

Execution Diagram

GitHub URL

Workflow Repository