Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:InternLM Lmdeploy CUDA GPU Runtime

From Leeroopedia


Knowledge Sources
Domains Infrastructure, GPU_Acceleration
Last Updated 2026-02-07 15:00 GMT

Overview

Linux environment with NVIDIA CUDA GPU (compute capability >= 7.0), CUDA Toolkit 11+, and PyTorch 2.0+ for running LMDeploy inference engines.

Description

This environment provides the core GPU-accelerated runtime required by both TurboMind (C++) and PyTorch inference backends in LMDeploy. It requires an NVIDIA GPU with CUDA support. The minimum supported compute capability is SM 7.0 (Volta/V100), with newer architectures (Ampere, Hopper, Blackwell) unlocking additional features like BFloat16, FP8 quantization, and FlashAttention-3. The TurboMind backend compiles CUDA kernels targeting specific architectures at build time, while the PyTorch backend leverages Triton JIT compilation.

Usage

Use this environment for any LMDeploy inference workflow including offline batch inference, API server deployment, AWQ/SmoothQuant quantization, and VLM pipelines. All five documented workflows require CUDA GPU access. The PyTorch backend also supports alternative devices (Ascend, MACA, Cambricon), but the TurboMind backend is CUDA-only.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+ recommended) Windows supported for PyTorch backend only (TurboMind builds disabled)
Hardware NVIDIA GPU with Compute Capability >= 7.0 V100 (SM70), 2080 (SM75), A100 (SM80), 3090 (SM86), 4090 (SM89), H100 (SM90), 5090 (SM120)
VRAM Minimum 8GB Model-dependent; 7B models need ~14GB FP16, quantized models need less
CUDA Toolkit >= 11.0 BFloat16 requires CUDA >= 11 and SM >= 80; FP8 requires SM >= 90
Disk 20GB+ SSD For model weights and CUDA toolkit

Dependencies

System Packages

  • NVIDIA GPU Driver (compatible with CUDA toolkit version)
  • CUDA Toolkit >= 11.0 (for TurboMind build)
  • `nvidia-nccl` (multi-GPU tensor parallelism)
  • `nvidia-cuda-runtime` (CUDA runtime libraries)
  • `nvidia-cublas` (matrix operations)
  • `nvidia-curand` (random number generation)

Python Packages

  • `torch` >= 2.0.0, <= 2.8.0
  • `torchvision` >= 0.15.0, <= 0.23.0
  • `triton` >= 3.0.0, <= 3.4.0 (Linux x86_64 only, required for PyTorch backend kernels)

Credentials

No credentials required for the core CUDA runtime. Model downloading may require:

  • `HF_TOKEN`: HuggingFace API token for gated models (set via `huggingface-cli login`).
  • `LMDEPLOY_USE_MODELSCOPE`: Set to `'True'` to download models from ModelScope instead of HuggingFace.
  • `LMDEPLOY_USE_OPENMIND_HUB`: Set to `'True'` to download from OpenMind Hub.

Quick Install

# Install LMDeploy with CUDA support (pre-built wheel)
pip install lmdeploy

# Or install with all optional dependencies
pip install lmdeploy[all]

# For building from source with TurboMind
pip install -r requirements/build.txt
pip install -r requirements/runtime_cuda.txt

Code Evidence

CUDA availability check from `lmdeploy/pytorch/check_env/cuda.py:12-22`:

def check(self):
    """check."""
    import torch

    if not torch.cuda.is_available():
        self.log_and_exit(mod_name='CUDA', message='cuda is not available.')

    if self.model_format == 'fp8':
        props = torch.cuda.get_device_properties(0)
        if props.major < 9:
            self.log_and_exit(mod_name='CUDA', message='model_format=fp8 requires sm>=9.0.')

BFloat16 support detection from `lmdeploy/utils.py:389-405`:

def is_bf16_supported(device_type: str = 'cuda'):
    if device_type == 'cuda':
        import torch
        device = torch.cuda.current_device()
        cuda_version = torch.version.cuda
        if (cuda_version is not None and int(cuda_version.split('.')[0]) >= 11
                and torch.cuda.get_device_properties(device).major >= 8):
            return True
        else:
            return False

CUDA architecture targets from `CMakeLists.txt:226-253`:

if (NOT CMAKE_CUDA_ARCHITECTURES)
  set(CMAKE_CUDA_ARCHITECTURES "")
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS "13.0")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 70-real 75-real)  # V100, 2080
  endif()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "11")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 80-real) # A100
  endif ()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.0")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 90a-real) # H100
  endif ()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.8")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real) # 5090
  endif ()
endif ()

Common Errors

Error Message Cause Solution
`cuda is not available.` No NVIDIA GPU detected or CUDA drivers not installed Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi`
`model_format=fp8 requires sm>=9.0.` Attempting FP8 quantization on pre-Hopper GPU Use H100/H200 (SM90+) or switch to INT4/INT8 quantization
`RuntimeError: CUDA out of memory` Insufficient GPU VRAM for model + KV cache Reduce `cache_max_entry_count` (e.g., 0.2), use quantization, or increase tensor parallelism (`tp`)
`Fallback to pytorch engine because turbomind engine is not installed correctly` TurboMind C++ extension not built/installed Reinstall lmdeploy from PyPI or build from source with CUDA

Compatibility Notes

  • Windows: TurboMind backend is disabled on Windows (`DISABLE_TURBOMIND`). Only PyTorch backend available. Multi-GPU also disabled on Windows.
  • aarch64 (ARM/Jetson): Supported with SM72 and SM87 architectures. Triton is not available on ARM.
  • MSVC (Windows build): SM80 and SM90a architectures are excluded from MSVC builds.
  • FP8 Models: Require Hopper or newer GPUs (SM >= 9.0) and CUDA >= 12.0.
  • FlashAttention-3: Requires SM 9.0 and CUDA >= 12.3 (`flash_attn_interface` package).
  • BFloat16: Requires CUDA >= 11 and compute capability >= 8.0 (Ampere+).
  • Alternative Devices: PyTorch backend supports `ascend` (Huawei NPU), `maca` (MetaX), `camb` (Cambricon MLU) via the `dlinfer` framework.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment