Implementation:InternLM Lmdeploy PytorchEngineConfig
Appearance
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Configuration |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Concrete tool for configuring the PyTorch inference backend provided by the LMDeploy library.
Description
The PytorchEngineConfig is a Python dataclass that parameterizes the PyTorch-based inference engine. It supports data parallelism, expert parallelism, LoRA adapter serving, and multi-platform deployment beyond NVIDIA CUDA.
Usage
Import this class when deploying models on non-NVIDIA hardware, using SmoothQuant (W8A8) quantization, serving LoRA adapters, or when the model architecture is not supported by TurboMind.
Code Reference
Source Location
- Repository: lmdeploy
- File: lmdeploy/messages.py
- Lines: L297-442
Signature
@dataclass
class PytorchEngineConfig:
dtype: str = 'auto' # 'auto', 'float16', 'bfloat16'
tp: int = 1 # Tensor parallelism
dp: int = 1 # Data parallelism
dp_rank: int = 0 # DP rank
ep: int = 1 # Expert parallelism (MoE)
session_len: int = None # Max sequence length
max_batch_size: int = None # Max concurrent requests
cache_max_entry_count: float = 0.8 # KV cache memory fraction
prefill_interval: int = 16 # Prefill scheduling interval
block_size: int = 64 # Paging cache block size
num_cpu_blocks: int = 0 # CPU offload blocks
num_gpu_blocks: int = 0 # GPU cache blocks (0=auto)
adapters: Dict[str, str] = None # LoRA adapter paths
max_prefill_token_num: int = 4096 # Tokens per prefill iteration
thread_safe: bool = False # Thread-safe mode
enable_prefix_caching: bool = False # Prefix caching
device_type: str = 'cuda' # 'cuda', 'ascend', 'maca', 'camb'
eager_mode: bool = False # Eager execution mode
download_dir: str = None # Model download directory
revision: str = None # Model version
quant_policy: Literal[0, 4, 8] = 0 # KV cache quantization
distributed_executor_backend: str = None # 'uni', 'mp', 'ray'
empty_init: bool = False # Skip weight loading
model_format: str = None # 'fp8' for FP8 models
enable_metrics: bool = True # Metrics collection
Import
from lmdeploy import PytorchEngineConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dtype | str | No | Weight precision: 'auto', 'float16', 'bfloat16' |
| tp | int | No | Tensor parallelism GPUs (default: 1) |
| dp | int | No | Data parallelism groups (default: 1) |
| ep | int | No | Expert parallelism for MoE (default: 1) |
| session_len | int | No | Max sequence length |
| cache_max_entry_count | float | No | KV cache GPU memory fraction (default: 0.8) |
| adapters | Dict[str, str] | No | LoRA adapter name-to-path mapping |
| device_type | str | No | Target device: 'cuda', 'ascend', 'maca', 'camb' |
Outputs
| Name | Type | Description |
|---|---|---|
| PytorchEngineConfig | dataclass | Validated configuration for PyTorch backend |
Usage Examples
SmoothQuant Model
from lmdeploy import pipeline, PytorchEngineConfig
# SmoothQuant W8A8 requires PyTorch backend
backend_config = PytorchEngineConfig(
tp=2,
session_len=4096,
cache_max_entry_count=0.8
)
pipe = pipeline('./smoothquant_model', backend_config=backend_config)
LoRA Adapter Serving
from lmdeploy import pipeline, PytorchEngineConfig
backend_config = PytorchEngineConfig(
tp=1,
adapters={
"sql_expert": "/path/to/sql_lora",
"code_expert": "/path/to/code_lora"
}
)
pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment