Environment:LMCache LMCache CUDA GPU Runtime
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, GPU_Computing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
NVIDIA CUDA GPU runtime environment with CUDA 12.x, supporting compute capabilities 7.0 through 10.0 (V100 to B200).
Description
This environment provides the GPU-accelerated runtime required by LMCache for KV cache operations. LMCache compiles C++/CUDA extensions (memory kernels, arithmetic coding, positional encoding kernels) that require an NVIDIA GPU with CUDA 12.x. The build system supports CUDA compute capabilities 7.0 (V100), 7.5 (T4), 8.0 (A100/A30), 8.6 (A40/A10), 8.9 (L4/L40/L40S), 9.0 (H100/H200), and 10.0 (B200/GB200). An AMD ROCm (HIP) build path is also available via the `BUILD_WITH_HIP` environment variable. When CUDA is not available, LMCache falls back to non-CUDA equivalents with reduced functionality (no pinned memory, no NUMA awareness).
Usage
Use this environment for any LMCache deployment that requires GPU-accelerated KV cache operations. This includes all workflows: KV Cache Offloading (GPU-to-CPU transfers), Disaggregated Prefill (RDMA-based KV transfer), P2P KV Cache Sharing (cross-instance GPU transfers), and CacheBlend KV Reuse (RoPE kernel operations). CPU-only mode is available but limited to local storage backends without GPU memory management.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (POSIX) | `pyproject.toml` classifies as `Operating System :: POSIX :: Linux` |
| Hardware | NVIDIA GPU (compute capability >= 7.0) | V100, T4, A100, A40, L4, L40S, H100, H200, B200 |
| CUDA Toolkit | CUDA 12.x | `cupy-cuda12x` required; build image uses CUDA 12.8 |
| Alternative | AMD ROCm (HIP) | Set `BUILD_WITH_HIP=1` and `ROCM_PATH` (default `/opt/rocm`) |
| Alternative | Intel XPU | Supported via vLLM platform detection; separate XPU connectors |
Dependencies
System Packages
- CUDA Toolkit 12.x (for NVIDIA builds)
- ROCm (for AMD HIP builds, optional)
- C++ compiler with C++11 ABI support
Python Packages
- `torch` (build pins `torch==2.8.0`; runtime is flexible)
- `cupy-cuda12x` (CUDA 12.x array library)
- `nvidia-ml-py` (GPU monitoring via pynvml, CUDA extras)
- `ray` >= 2.9 (CUDA extras)
- `xformers` (CUDA extras, optimized transformers)
- `nvtx` (optional, NVIDIA profiling markers with fallback)
- `numpy` <= 2.2.6 (constrained by numba/NIXL compatibility)
Credentials
The following build-time environment variables control the CUDA/HIP build:
- `NO_CUDA_EXT`: Set to `"1"` to skip building CUDA extensions entirely (sdist mode).
- `BUILD_WITH_HIP`: Set to `"1"` to build ROCm/HIP extensions instead of CUDA.
- `ENABLE_CXX11_ABI`: Set to `"0"` to disable C++11 ABI (default `"1"`).
- `ROCM_PATH`: Path to ROCm installation (default `/opt/rocm`).
- `TORCH_CUDA_ARCH_LIST`: CUDA compute capabilities to target (default `"7.0;7.5;8.0;8.6;8.9;9.0;10.0"`).
Quick Install
# Standard CUDA install (requires CUDA 12.x toolkit)
pip install lmcache
# Install from source (recommended for version flexibility)
pip install -e . --no-build-isolation
# For AMD ROCm builds
BUILD_WITH_HIP=1 pip install -e . --no-build-isolation
# Skip CUDA extensions (CPU-only)
NO_CUDA_EXT=1 pip install -e .
Code Evidence
CUDA availability check with C extension fallback from `lmcache/v1/memory_management.py:26-31`:
if torch.cuda.is_available():
# First Party
import lmcache.c_ops as lmc_ops
else:
# First Party
import lmcache.non_cuda_equivalents as lmc_ops
Build system GPU architecture targeting from `pyproject.toml:158`:
# see https://developer.nvidia.com/cuda-gpus for compute capabilities
# 7.0: V100
# 7.5: T4
# 8.0: A100, A30
# 8.6: A40, A10, A16, A2
# 8.9: L4, L40, L40S
# 9.0: GH200, H200, H100
# 10.0: GB200, B200
environment = {TORCH_CUDA_ARCH_LIST = "7.0;7.5;8.0;8.6;8.9;9.0;10.0"}
HIP/ROCm build path from `setup.py:19,133`:
BUILD_WITH_HIP = os.environ.get("BUILD_WITH_HIP", "0") == "1"
# ...
define_macros = [("__HIP_PLATFORM_HCC__", "1"), ("USE_ROCM", "1")]
XPU device detection from `lmcache/v1/gpu_connector/__init__.py:88-97`:
if dev_name == "cuda":
if config.use_gpu_connector_v3:
return VLLMPagedMemGPUConnectorV3.from_metadata(metadata, use_gpu, device)
else:
return VLLMPagedMemGPUConnectorV2.from_metadata(metadata, use_gpu, device)
elif dev_name == "xpu":
from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
return VLLMPagedMemXPUConnectorV2.from_metadata(metadata, use_gpu, device)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: lmcache.c_ops` | CUDA extensions not compiled | Rebuild with CUDA toolkit installed: `pip install -e . --no-build-isolation` |
| `RuntimeError: No supported connector found for the current platform` | Unsupported GPU platform | Ensure NVIDIA CUDA, AMD ROCm, or Intel XPU runtime is available |
| `CUDA out of memory` | Insufficient GPU VRAM | Reduce `max_local_cpu_size` or use CPU-only backends |
| `torch.cuda.is_available() returns False` | No CUDA runtime detected | Install CUDA toolkit 12.x and verify with `nvidia-smi` |
Compatibility Notes
- NVIDIA CUDA: Primary supported platform. Requires compute capability >= 7.0.
- AMD ROCm (HIP): Supported via `BUILD_WITH_HIP=1`. Uses hipify to convert CUDA sources. Set `CXX=hipcc` during build.
- Intel XPU: Supported through vLLM platform detection. Uses separate `VLLMPagedMemXPUConnectorV2` connector.
- CPU-only: Falls back to `non_cuda_equivalents` module. No pinned memory, no NUMA awareness, no GPU memory management.
- Build isolation: Recommended to use `--no-build-isolation` to avoid torch version conflicts with serving engines.
Related Pages
- Implementation:LMCache_LMCache_LMCacheEngine_Store
- Implementation:LMCache_LMCache_LMCacheEngine_Retrieve
- Implementation:LMCache_LMCache_LMCacheConnectorV1Impl_Init
- Implementation:LMCache_LMCache_PDBackend_Batched_Submit_Put_Task
- Implementation:LMCache_LMCache_P2PBackend_Batched_Get_Non_Blocking
- Implementation:LMCache_LMCache_LMCBlender_Blend