Environment:InternLM Lmdeploy CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Acceleration |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Linux environment with NVIDIA CUDA GPU (compute capability >= 7.0), CUDA Toolkit 11+, and PyTorch 2.0+ for running LMDeploy inference engines.
Description
This environment provides the core GPU-accelerated runtime required by both TurboMind (C++) and PyTorch inference backends in LMDeploy. It requires an NVIDIA GPU with CUDA support. The minimum supported compute capability is SM 7.0 (Volta/V100), with newer architectures (Ampere, Hopper, Blackwell) unlocking additional features like BFloat16, FP8 quantization, and FlashAttention-3. The TurboMind backend compiles CUDA kernels targeting specific architectures at build time, while the PyTorch backend leverages Triton JIT compilation.
Usage
Use this environment for any LMDeploy inference workflow including offline batch inference, API server deployment, AWQ/SmoothQuant quantization, and VLM pipelines. All five documented workflows require CUDA GPU access. The PyTorch backend also supports alternative devices (Ascend, MACA, Cambricon), but the TurboMind backend is CUDA-only.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+ recommended) | Windows supported for PyTorch backend only (TurboMind builds disabled) |
| Hardware | NVIDIA GPU with Compute Capability >= 7.0 | V100 (SM70), 2080 (SM75), A100 (SM80), 3090 (SM86), 4090 (SM89), H100 (SM90), 5090 (SM120) |
| VRAM | Minimum 8GB | Model-dependent; 7B models need ~14GB FP16, quantized models need less |
| CUDA Toolkit | >= 11.0 | BFloat16 requires CUDA >= 11 and SM >= 80; FP8 requires SM >= 90 |
| Disk | 20GB+ SSD | For model weights and CUDA toolkit |
Dependencies
System Packages
- NVIDIA GPU Driver (compatible with CUDA toolkit version)
- CUDA Toolkit >= 11.0 (for TurboMind build)
- `nvidia-nccl` (multi-GPU tensor parallelism)
- `nvidia-cuda-runtime` (CUDA runtime libraries)
- `nvidia-cublas` (matrix operations)
- `nvidia-curand` (random number generation)
Python Packages
- `torch` >= 2.0.0, <= 2.8.0
- `torchvision` >= 0.15.0, <= 0.23.0
- `triton` >= 3.0.0, <= 3.4.0 (Linux x86_64 only, required for PyTorch backend kernels)
Credentials
No credentials required for the core CUDA runtime. Model downloading may require:
- `HF_TOKEN`: HuggingFace API token for gated models (set via `huggingface-cli login`).
- `LMDEPLOY_USE_MODELSCOPE`: Set to `'True'` to download models from ModelScope instead of HuggingFace.
- `LMDEPLOY_USE_OPENMIND_HUB`: Set to `'True'` to download from OpenMind Hub.
Quick Install
# Install LMDeploy with CUDA support (pre-built wheel)
pip install lmdeploy
# Or install with all optional dependencies
pip install lmdeploy[all]
# For building from source with TurboMind
pip install -r requirements/build.txt
pip install -r requirements/runtime_cuda.txt
Code Evidence
CUDA availability check from `lmdeploy/pytorch/check_env/cuda.py:12-22`:
def check(self):
"""check."""
import torch
if not torch.cuda.is_available():
self.log_and_exit(mod_name='CUDA', message='cuda is not available.')
if self.model_format == 'fp8':
props = torch.cuda.get_device_properties(0)
if props.major < 9:
self.log_and_exit(mod_name='CUDA', message='model_format=fp8 requires sm>=9.0.')
BFloat16 support detection from `lmdeploy/utils.py:389-405`:
def is_bf16_supported(device_type: str = 'cuda'):
if device_type == 'cuda':
import torch
device = torch.cuda.current_device()
cuda_version = torch.version.cuda
if (cuda_version is not None and int(cuda_version.split('.')[0]) >= 11
and torch.cuda.get_device_properties(device).major >= 8):
return True
else:
return False
CUDA architecture targets from `CMakeLists.txt:226-253`:
if (NOT CMAKE_CUDA_ARCHITECTURES)
set(CMAKE_CUDA_ARCHITECTURES "")
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_LESS "13.0")
list(APPEND CMAKE_CUDA_ARCHITECTURES 70-real 75-real) # V100, 2080
endif()
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "11")
list(APPEND CMAKE_CUDA_ARCHITECTURES 80-real) # A100
endif ()
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.0")
list(APPEND CMAKE_CUDA_ARCHITECTURES 90a-real) # H100
endif ()
if (${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL "12.8")
list(APPEND CMAKE_CUDA_ARCHITECTURES 120a-real) # 5090
endif ()
endif ()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `cuda is not available.` | No NVIDIA GPU detected or CUDA drivers not installed | Install NVIDIA drivers and CUDA toolkit; verify with `nvidia-smi` |
| `model_format=fp8 requires sm>=9.0.` | Attempting FP8 quantization on pre-Hopper GPU | Use H100/H200 (SM90+) or switch to INT4/INT8 quantization |
| `RuntimeError: CUDA out of memory` | Insufficient GPU VRAM for model + KV cache | Reduce `cache_max_entry_count` (e.g., 0.2), use quantization, or increase tensor parallelism (`tp`) |
| `Fallback to pytorch engine because turbomind engine is not installed correctly` | TurboMind C++ extension not built/installed | Reinstall lmdeploy from PyPI or build from source with CUDA |
Compatibility Notes
- Windows: TurboMind backend is disabled on Windows (`DISABLE_TURBOMIND`). Only PyTorch backend available. Multi-GPU also disabled on Windows.
- aarch64 (ARM/Jetson): Supported with SM72 and SM87 architectures. Triton is not available on ARM.
- MSVC (Windows build): SM80 and SM90a architectures are excluded from MSVC builds.
- FP8 Models: Require Hopper or newer GPUs (SM >= 9.0) and CUDA >= 12.0.
- FlashAttention-3: Requires SM 9.0 and CUDA >= 12.3 (`flash_attn_interface` package).
- BFloat16: Requires CUDA >= 11 and compute capability >= 8.0 (Ampere+).
- Alternative Devices: PyTorch backend supports `ascend` (Huawei NPU), `maca` (MetaX), `camb` (Cambricon MLU) via the `dlinfer` framework.
Related Pages
- Implementation:InternLM_Lmdeploy_TurbomindEngineConfig
- Implementation:InternLM_Lmdeploy_PytorchEngineConfig
- Implementation:InternLM_Lmdeploy_Pipeline_Factory
- Implementation:InternLM_Lmdeploy_Autoget_Backend
- Implementation:InternLM_Lmdeploy_Auto_Awq
- Implementation:InternLM_Lmdeploy_Smooth_Quant
- Implementation:InternLM_Lmdeploy_Pipeline_Factory_AWQ
- Implementation:InternLM_Lmdeploy_Pipeline_Factory_Pytorch