Principle:InternLM Lmdeploy Backend Auto Selection
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Architecture_Detection |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
An automatic detection mechanism that selects the optimal inference backend (TurboMind or PyTorch) based on model architecture, quantization format, and hardware constraints.
Description
Backend Auto Selection solves the problem of routing models to the correct inference engine without requiring users to know the internal capabilities of each backend. The decision logic considers:
- Model architecture: TurboMind supports a curated list of architectures (LLaMA, InternLM, Qwen, Mistral, etc.); unsupported models fall back to PyTorch
- Quantization format: AWQ/GPTQ models use TurboMind; SmoothQuant models require PyTorch
- Hardware platform: Non-CUDA platforms (Ascend, Cambricon) must use PyTorch
- Vision-language models: VLMs are detected via architecture class names and use VLAsyncEngine
The system reads the model's HuggingFace config to extract the architecture class, then looks up a mapping to determine backend support.
Usage
This happens automatically during pipeline initialization. Override it by explicitly passing a backend_config of the desired type (TurbomindEngineConfig or PytorchEngineConfig).
Theoretical Basis
Backend selection uses a Strategy Pattern with architecture-based dispatch:
# Abstract selection algorithm
def select_backend(model_config, user_config):
arch = model_config.architectures[0]
if user_config is TurbomindEngineConfig:
return 'turbomind'
if user_config is PytorchEngineConfig:
return 'pytorch'
if arch in TURBOMIND_SUPPORTED:
return 'turbomind'
return 'pytorch' # fallback