Environment:InternLM Lmdeploy CUDA GPU Runtime

Knowledge Sources	LMDeploy NVIDIA CUDA Toolkit
Domains	Infrastructure, GPU_Acceleration
Last Updated	2026-02-07 15:00 GMT

Overview

Linux environment with NVIDIA CUDA GPU (compute capability >= 7.0), CUDA Toolkit 11+, and PyTorch 2.0+ for running LMDeploy inference engines.

Description

This environment provides the core GPU-accelerated runtime required by both TurboMind (C++) and PyTorch inference backends in LMDeploy. It requires an NVIDIA GPU with CUDA support. The minimum supported compute capability is SM 7.0 (Volta/V100), with newer architectures (Ampere, Hopper, Blackwell) unlocking additional features like BFloat16, FP8 quantization, and FlashAttention-3. The TurboMind backend compiles CUDA kernels targeting specific architectures at build time, while the PyTorch backend leverages Triton JIT compilation.

Usage

Use this environment for any LMDeploy inference workflow including offline batch inference, API server deployment, AWQ/SmoothQuant quantization, and VLM pipelines. All five documented workflows require CUDA GPU access. The PyTorch backend also supports alternative devices (Ascend, MACA, Cambricon), but the TurboMind backend is CUDA-only.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+ recommended)	Windows supported for PyTorch backend only (TurboMind builds disabled)
Hardware	NVIDIA GPU with Compute Capability >= 7.0	V100 (SM70), 2080 (SM75), A100 (SM80), 3090 (SM86), 4090 (SM89), H100 (SM90), 5090 (SM120)
VRAM	Minimum 8GB	Model-dependent; 7B models need ~14GB FP16, quantized models need less
CUDA Toolkit	>= 11.0	BFloat16 requires CUDA >= 11 and SM >= 80; FP8 requires SM >= 90
Disk	20GB+ SSD	For model weights and CUDA toolkit

Dependencies

System Packages

NVIDIA GPU Driver (compatible with CUDA toolkit version)
CUDA Toolkit >= 11.0 (for TurboMind build)
`nvidia-nccl` (multi-GPU tensor parallelism)
`nvidia-cuda-runtime` (CUDA runtime libraries)
`nvidia-cublas` (matrix operations)
`nvidia-curand` (random number generation)

Python Packages

`torch` >= 2.0.0, <= 2.8.0
`torchvision` >= 0.15.0, <= 0.23.0
`triton` >= 3.0.0, <= 3.4.0 (Linux x86_64 only, required for PyTorch backend kernels)

Credentials

No credentials required for the core CUDA runtime. Model downloading may require:

`HF_TOKEN`: HuggingFace API token for gated models (set via `huggingface-cli login`).
`LMDEPLOY_USE_MODELSCOPE`: Set to `'True'` to download models from ModelScope instead of HuggingFace.
`LMDEPLOY_USE_OPENMIND_HUB`: Set to `'True'` to download from OpenMind Hub.

Quick Install

# Install LMDeploy with CUDA support (pre-built wheel)
pip install lmdeploy

# Or install with all optional dependencies
pip install lmdeploy[all]

# For building from source with TurboMind
pip install -r requirements/build.txt
pip install -r requirements/runtime_cuda.txt

Code Evidence

CUDA availability check from `lmdeploy/pytorch/check_env/cuda.py:12-22`:

def check(self):
    """check."""
    import torch

    if not torch.cuda.is_available():
        self.log_and_exit(mod_name='CUDA', message='cuda is not available.')

    if self.model_format == 'fp8':
        props = torch.cuda.get_device_properties(0)
        if props.major < 9:
            self.log_and_exit(mod_name='CUDA', message='model_format=fp8 requires sm>=9.0.')

BFloat16 support detection from `lmdeploy/utils.py:389-405`:

def is_bf16_supported(device_type: str = 'cuda'):
    if device_type == 'cuda':
        import torch
        device = torch.cuda.current_device()
        cuda_version = torch.version.cuda
        if (cuda_version is not None and int(cuda_version.split('.')[0]) >= 11
                and torch.cuda.get_device_properties(device).major >= 8):
            return True
        else:
            return False

CUDA architecture targets from `CMakeLists.txt:226-253`:

if (NOT CMAKE_CUDA_ARCHITECTURES)
  set(CMAKE_CUDA_ARCHITECTURES "")
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS "13.0")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 70-real 75-real)  # V100, 2080
  endif()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "11")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 80-real) # A100
  endif ()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.0")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 90a-real) # H100
  endif ()
  if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.8")
    list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real) # 5090
  endif ()
endif ()

Common Errors

Error Message	Cause	Solution
`cuda is not available.`	No NVIDIA GPU detected or CUDA drivers not installed	Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi`
`model_format=fp8 requires sm>=9.0.`	Attempting FP8 quantization on pre-Hopper GPU	Use H100/H200 (SM90+) or switch to INT4/INT8 quantization
`RuntimeError: CUDA out of memory`	Insufficient GPU VRAM for model + KV cache	Reduce `cache_max_entry_count` (e.g., 0.2), use quantization, or increase tensor parallelism (`tp`)
`Fallback to pytorch engine because turbomind engine is not installed correctly`	TurboMind C++ extension not built/installed	Reinstall lmdeploy from PyPI or build from source with CUDA

Compatibility Notes

Windows: TurboMind backend is disabled on Windows (`DISABLE_TURBOMIND`). Only PyTorch backend available. Multi-GPU also disabled on Windows.
aarch64 (ARM/Jetson): Supported with SM72 and SM87 architectures. Triton is not available on ARM.
MSVC (Windows build): SM80 and SM90a architectures are excluded from MSVC builds.
FP8 Models: Require Hopper or newer GPUs (SM >= 9.0) and CUDA >= 12.0.
FlashAttention-3: Requires SM 9.0 and CUDA >= 12.3 (`flash_attn_interface` package).
BFloat16: Requires CUDA >= 11 and compute capability >= 8.0 (Ampere+).
Alternative Devices: PyTorch backend supports `ascend` (Huawei NPU), `maca` (MetaX), `camb` (Cambricon MLU) via the `dlinfer` framework.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment