Implementation:InternLM Lmdeploy TurbomindEngineConfig
Appearance
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Configuration |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Concrete tool for configuring the TurboMind C++/CUDA inference backend provided by the LMDeploy library.
Description
The TurbomindEngineConfig is a Python dataclass that parameterizes the TurboMind inference engine. It controls GPU memory allocation, tensor parallelism, model format detection, and quantization settings for high-performance LLM inference on NVIDIA GPUs.
Usage
Import this class when you need to configure TurboMind backend settings such as tensor parallelism, KV cache allocation, model format (hf/awq/gptq), or session length before creating an inference pipeline.
Code Reference
Source Location
- Repository: lmdeploy
- File: lmdeploy/messages.py
- Lines: L183-295
Signature
@dataclass
class TurbomindEngineConfig:
model_format: Optional[str] = None # 'hf', 'awq', 'gptq', None
tp: int = 1 # Tensor parallelism GPU count
session_len: Optional[int] = None # Max sequence length
max_batch_size: int = None # Max concurrent requests
cache_max_entry_count: float = 0.8 # KV cache GPU memory fraction
cache_block_seq_len: int = 64 # Paging cache block size
enable_prefix_caching: bool = False # Token-level prefix caching
quant_policy: int = 0 # KV cache quant: 0, 4, or 8
rope_scaling_factor: float = 0.0 # RoPE scaling for long context
use_logn_attn: bool = False # LogN attention scaling
download_dir: Optional[str] = None # Model download directory
revision: Optional[str] = None # Model version/branch
max_prefill_token_num: int = 8192 # Max tokens per prefill iteration
num_tokens_per_iter: int = 0 # Tokens per decode iteration
max_prefill_iters: int = 1 # Max prefill iterations
dtype: str = 'auto' # 'auto', 'float16', 'bfloat16'
Import
from lmdeploy import TurbomindEngineConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_format | Optional[str] | No | Weight format: 'hf', 'awq', 'gptq', or None for auto-detect |
| tp | int | No | Number of GPUs for tensor parallelism (default: 1) |
| session_len | Optional[int] | No | Maximum sequence length (default: from model config) |
| max_batch_size | int | No | Maximum concurrent requests |
| cache_max_entry_count | float | No | Fraction of free GPU memory for KV cache (default: 0.8) |
| dtype | str | No | Weight precision: 'auto', 'float16', 'bfloat16' |
| quant_policy | int | No | KV cache quantization: 0 (none), 4 (INT4), 8 (INT8) |
Outputs
| Name | Type | Description |
|---|---|---|
| TurbomindEngineConfig | dataclass | Validated configuration instance passed to pipeline() |
Usage Examples
Basic Configuration
from lmdeploy import pipeline, TurbomindEngineConfig
# Configure for 2-GPU tensor parallelism with 80% KV cache
backend_config = TurbomindEngineConfig(
tp=2,
cache_max_entry_count=0.8,
session_len=4096
)
pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)
AWQ Quantized Model
from lmdeploy import pipeline, TurbomindEngineConfig
# Load AWQ-quantized model
backend_config = TurbomindEngineConfig(
model_format='awq',
tp=1,
cache_max_entry_count=0.9 # More cache since model is smaller
)
pipe = pipeline('./quantized_model', backend_config=backend_config)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment