Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:InternLM Lmdeploy Backend Selection Strategy

From Leeroopedia





Knowledge Sources
Domains Architecture, Inference
Last Updated 2026-02-07 15:00 GMT

Overview

Decision framework for choosing between TurboMind (C++ CUDA) and PyTorch inference backends based on model support, hardware, and feature requirements.

Description

LMDeploy provides two inference backends: TurboMind (optimized C++ with custom CUDA kernels) and PyTorch (flexible Python-based with Triton kernels). The `autoget_backend()` function automatically selects the best backend based on model architecture support and TurboMind installation status. Understanding when each backend is preferred helps users make informed deployment decisions.

Usage

Use this heuristic when:

  • You see the warning Fallback to pytorch engine and want to understand why.
  • You need to choose a backend explicitly for a deployment.
  • You are deciding which models to deploy on a given infrastructure.
  • You are experiencing performance differences between backends.

The Insight (Rule of Thumb)

  • Prefer TurboMind when:
    • The model architecture is supported (check `lmdeploy.turbomind.supported_models`).
    • Maximum inference speed is the priority (custom CUDA kernels are faster).
    • Using AWQ quantized models (`model_format='awq'`).
    • Deploying on CUDA-only infrastructure.
  • Prefer PyTorch when:
    • The model is not supported by TurboMind (automatic fallback occurs).
    • Running on non-CUDA hardware (Ascend, MACA, Cambricon, ROCm).
    • Using W8A8 SmoothQuant quantization (PyTorch only).
    • Using LoRA adapters at serving time.
    • Working with very new or custom model architectures.
  • Fallback behavior: If TurboMind is not installed or does not support the model, LMDeploy automatically falls back to PyTorch with a warning.
  • Trade-off: TurboMind has higher performance but narrower model support. PyTorch has broader support but slightly lower throughput.

Reasoning

TurboMind uses hand-tuned CUDA kernels compiled at build time, which outperform general-purpose PyTorch/Triton kernels for supported architectures. However, adding new model support to TurboMind requires C++ development, so new models are typically supported in the PyTorch backend first.

The fallback mechanism ensures users never encounter a hard failure. Two fallback scenarios exist:

Code evidence from `lmdeploy/archs.py:13-55`:

def autoget_backend(model_path: str) -> Literal['turbomind', 'pytorch']:
    turbomind_has = False
    is_turbomind_installed = True
    try:
        from lmdeploy.turbomind.supported_models import is_supported
        turbomind_has = is_supported(model_path)
    except ImportError:
        is_turbomind_installed = False

    if is_turbomind_installed:
        if not turbomind_has:
            logger.warning(
                'Fallback to pytorch engine because '
                f'`{model_path}` not supported by turbomind engine.')
    else:
        logger.warning(
            'Fallback to pytorch engine because turbomind engine '
            'is not installed correctly.')

    backend = 'turbomind' if turbomind_has else 'pytorch'
    return backend

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment