Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:InternLM Lmdeploy TurbomindEngineConfig

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Configuration
Last Updated 2026-02-07 15:00 GMT

Overview

Concrete tool for configuring the TurboMind C++/CUDA inference backend provided by the LMDeploy library.

Description

The TurbomindEngineConfig is a Python dataclass that parameterizes the TurboMind inference engine. It controls GPU memory allocation, tensor parallelism, model format detection, and quantization settings for high-performance LLM inference on NVIDIA GPUs.

Usage

Import this class when you need to configure TurboMind backend settings such as tensor parallelism, KV cache allocation, model format (hf/awq/gptq), or session length before creating an inference pipeline.

Code Reference

Source Location

  • Repository: lmdeploy
  • File: lmdeploy/messages.py
  • Lines: L183-295

Signature

@dataclass
class TurbomindEngineConfig:
    model_format: Optional[str] = None    # 'hf', 'awq', 'gptq', None
    tp: int = 1                           # Tensor parallelism GPU count
    session_len: Optional[int] = None     # Max sequence length
    max_batch_size: int = None            # Max concurrent requests
    cache_max_entry_count: float = 0.8    # KV cache GPU memory fraction
    cache_block_seq_len: int = 64         # Paging cache block size
    enable_prefix_caching: bool = False   # Token-level prefix caching
    quant_policy: int = 0                 # KV cache quant: 0, 4, or 8
    rope_scaling_factor: float = 0.0      # RoPE scaling for long context
    use_logn_attn: bool = False           # LogN attention scaling
    download_dir: Optional[str] = None    # Model download directory
    revision: Optional[str] = None        # Model version/branch
    max_prefill_token_num: int = 8192     # Max tokens per prefill iteration
    num_tokens_per_iter: int = 0          # Tokens per decode iteration
    max_prefill_iters: int = 1            # Max prefill iterations
    dtype: str = 'auto'                   # 'auto', 'float16', 'bfloat16'

Import

from lmdeploy import TurbomindEngineConfig

I/O Contract

Inputs

Name Type Required Description
model_format Optional[str] No Weight format: 'hf', 'awq', 'gptq', or None for auto-detect
tp int No Number of GPUs for tensor parallelism (default: 1)
session_len Optional[int] No Maximum sequence length (default: from model config)
max_batch_size int No Maximum concurrent requests
cache_max_entry_count float No Fraction of free GPU memory for KV cache (default: 0.8)
dtype str No Weight precision: 'auto', 'float16', 'bfloat16'
quant_policy int No KV cache quantization: 0 (none), 4 (INT4), 8 (INT8)

Outputs

Name Type Description
TurbomindEngineConfig dataclass Validated configuration instance passed to pipeline()

Usage Examples

Basic Configuration

from lmdeploy import pipeline, TurbomindEngineConfig

# Configure for 2-GPU tensor parallelism with 80% KV cache
backend_config = TurbomindEngineConfig(
    tp=2,
    cache_max_entry_count=0.8,
    session_len=4096
)

pipe = pipeline('internlm/internlm2_5-7b-chat', backend_config=backend_config)

AWQ Quantized Model

from lmdeploy import pipeline, TurbomindEngineConfig

# Load AWQ-quantized model
backend_config = TurbomindEngineConfig(
    model_format='awq',
    tp=1,
    cache_max_entry_count=0.9  # More cache since model is smaller
)

pipe = pipeline('./quantized_model', backend_config=backend_config)

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment