Implementation:InternLM Lmdeploy PytorchEngineConfig

Knowledge Sources	LMDeploy PyTorch Backend
Domains	LLM_Inference, Configuration
Last Updated	2026-02-07 15:00 GMT

Overview

Concrete tool for configuring the PyTorch inference backend provided by the LMDeploy library.

Description

The PytorchEngineConfig is a Python dataclass that parameterizes the PyTorch-based inference engine. It supports data parallelism, expert parallelism, LoRA adapter serving, and multi-platform deployment beyond NVIDIA CUDA.

Usage

Import this class when deploying models on non-NVIDIA hardware, using SmoothQuant (W8A8) quantization, serving LoRA adapters, or when the model architecture is not supported by TurboMind.

Code Reference

Source Location

Repository: lmdeploy
File: lmdeploy/messages.py
Lines: L297-442

Signature

@dataclass
class PytorchEngineConfig:
    dtype: str = 'auto'                       # 'auto', 'float16', 'bfloat16'
    tp: int = 1                               # Tensor parallelism
    dp: int = 1                               # Data parallelism
    dp_rank: int = 0                          # DP rank
    ep: int = 1                               # Expert parallelism (MoE)
    session_len: int = None                   # Max sequence length
    max_batch_size: int = None                # Max concurrent requests
    cache_max_entry_count: float = 0.8        # KV cache memory fraction
    prefill_interval: int = 16                # Prefill scheduling interval
    block_size: int = 64                      # Paging cache block size
    num_cpu_blocks: int = 0                   # CPU offload blocks
    num_gpu_blocks: int = 0                   # GPU cache blocks (0=auto)
    adapters: Dict[str, str] = None           # LoRA adapter paths
    max_prefill_token_num: int = 4096         # Tokens per prefill iteration
    thread_safe: bool = False                 # Thread-safe mode
    enable_prefix_caching: bool = False       # Prefix caching
    device_type: str = 'cuda'                 # 'cuda', 'ascend', 'maca', 'camb'
    eager_mode: bool = False                  # Eager execution mode
    download_dir: str = None                  # Model download directory
    revision: str = None                      # Model version
    quant_policy: Literal[0, 4, 8] = 0        # KV cache quantization
    distributed_executor_backend: str = None   # 'uni', 'mp', 'ray'
    empty_init: bool = False                  # Skip weight loading
    model_format: str = None                  # 'fp8' for FP8 models
    enable_metrics: bool = True               # Metrics collection

Import

from lmdeploy import PytorchEngineConfig

I/O Contract

Inputs

Name	Type	Required	Description
dtype	str	No	Weight precision: 'auto', 'float16', 'bfloat16'
tp	int	No	Tensor parallelism GPUs (default: 1)
dp	int	No	Data parallelism groups (default: 1)
ep	int	No	Expert parallelism for MoE (default: 1)
session_len	int	No	Max sequence length
cache_max_entry_count	float	No	KV cache GPU memory fraction (default: 0.8)
adapters	Dict[str, str]	No	LoRA adapter name-to-path mapping
device_type	str	No	Target device: 'cuda', 'ascend', 'maca', 'camb'

Outputs

Name	Type	Description
PytorchEngineConfig	dataclass	Validated configuration for PyTorch backend

Usage Examples

SmoothQuant Model

from lmdeploy import pipeline, PytorchEngineConfig

# SmoothQuant W8A8 requires PyTorch backend
backend_config = PytorchEngineConfig(
    tp=2,
    session_len=4096,
    cache_max_entry_count=0.8
)

pipe = pipeline('./smoothquant_model', backend_config=backend_config)

LoRA Adapter Serving

from lmdeploy import pipeline, PytorchEngineConfig

backend_config = PytorchEngineConfig(
    tp=1,
    adapters={
        "sql_expert": "/path/to/sql_lora",
        "code_expert": "/path/to/code_lora"
    }
)

pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)

Related Pages

Implements Principle

Principle:InternLM_Lmdeploy_Pytorch_Engine_Configuration

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment