Implementation:InternLM Lmdeploy Pipeline Factory Pytorch

Knowledge Sources	LMDeploy W8A8 Inference
Domains	LLM_Inference, Quantization
Last Updated	2026-02-07 15:00 GMT

Overview

Concrete tool for creating inference pipelines for SmoothQuant (W8A8) quantized models using the PyTorch backend provided by the LMDeploy library.

Description

This is the pipeline() factory function used specifically for SmoothQuant W8A8 model inference. SmoothQuant models must use PytorchEngineConfig because the TurboMind backend does not support the SmoothQuant weight format. The model's quantization_config is auto-detected.

Usage

Use this after quantizing a model with smooth_quant. Pass PytorchEngineConfig (not TurbomindEngineConfig) to ensure the correct backend is selected.

Code Reference

Source Location

Repository: lmdeploy
File: lmdeploy/api.py L15-74, lmdeploy/messages.py L297-442

Signature

# Same pipeline() factory with PyTorch backend for W8A8
pipe = pipeline(
    model_path,
    backend_config=PytorchEngineConfig(tp=N)
)

Import

from lmdeploy import pipeline, PytorchEngineConfig

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes	Path to SmoothQuant-quantized model
backend_config	PytorchEngineConfig	Yes	PyTorch backend config (required for SmoothQuant)

Outputs

Name	Type	Description
Pipeline	Pipeline	Inference pipeline with W8A8 kernels active

Usage Examples

from lmdeploy import pipeline, PytorchEngineConfig

# Load SmoothQuant model with PyTorch backend
backend_config = PytorchEngineConfig(
    tp=2,
    session_len=4096,
    cache_max_entry_count=0.8
)

pipe = pipeline('./internlm2_5-7b-w8a8', backend_config=backend_config)
response = pipe('What is machine learning?')
print(response.text)
pipe.close()

Related Pages

Implements Principle

Principle:InternLM_Lmdeploy_W8A8_Quantized_Inference

Requires Environment

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment