Environment:LMCache LMCache CUDA GPU Runtime

Knowledge Sources	LMCache NVIDIA CUDA
Domains	Infrastructure, GPU_Computing
Last Updated	2026-02-09 00:00 GMT

Overview

NVIDIA CUDA GPU runtime environment with CUDA 12.x, supporting compute capabilities 7.0 through 10.0 (V100 to B200).

Description

This environment provides the GPU-accelerated runtime required by LMCache for KV cache operations. LMCache compiles C++/CUDA extensions (memory kernels, arithmetic coding, positional encoding kernels) that require an NVIDIA GPU with CUDA 12.x. The build system supports CUDA compute capabilities 7.0 (V100), 7.5 (T4), 8.0 (A100/A30), 8.6 (A40/A10), 8.9 (L4/L40/L40S), 9.0 (H100/H200), and 10.0 (B200/GB200). An AMD ROCm (HIP) build path is also available via the `BUILD_WITH_HIP` environment variable. When CUDA is not available, LMCache falls back to non-CUDA equivalents with reduced functionality (no pinned memory, no NUMA awareness).

Usage

Use this environment for any LMCache deployment that requires GPU-accelerated KV cache operations. This includes all workflows: KV Cache Offloading (GPU-to-CPU transfers), Disaggregated Prefill (RDMA-based KV transfer), P2P KV Cache Sharing (cross-instance GPU transfers), and CacheBlend KV Reuse (RoPE kernel operations). CPU-only mode is available but limited to local storage backends without GPU memory management.

System Requirements

Category	Requirement	Notes
OS	Linux (POSIX)	`pyproject.toml` classifies as `Operating System :: POSIX :: Linux`
Hardware	NVIDIA GPU (compute capability >= 7.0)	V100, T4, A100, A40, L4, L40S, H100, H200, B200
CUDA Toolkit	CUDA 12.x	`cupy-cuda12x` required; build image uses CUDA 12.8
Alternative	AMD ROCm (HIP)	Set `BUILD_WITH_HIP=1` and `ROCM_PATH` (default `/opt/rocm`)
Alternative	Intel XPU	Supported via vLLM platform detection; separate XPU connectors

Dependencies

System Packages

CUDA Toolkit 12.x (for NVIDIA builds)
ROCm (for AMD HIP builds, optional)
C++ compiler with C++11 ABI support

Python Packages

`torch` (build pins `torch==2.8.0`; runtime is flexible)
`cupy-cuda12x` (CUDA 12.x array library)
`nvidia-ml-py` (GPU monitoring via pynvml, CUDA extras)
`ray` >= 2.9 (CUDA extras)
`xformers` (CUDA extras, optimized transformers)
`nvtx` (optional, NVIDIA profiling markers with fallback)
`numpy` <= 2.2.6 (constrained by numba/NIXL compatibility)

Credentials

The following build-time environment variables control the CUDA/HIP build:

`NO_CUDA_EXT`: Set to `"1"` to skip building CUDA extensions entirely (sdist mode).
`BUILD_WITH_HIP`: Set to `"1"` to build ROCm/HIP extensions instead of CUDA.
`ENABLE_CXX11_ABI`: Set to `"0"` to disable C++11 ABI (default `"1"`).
`ROCM_PATH`: Path to ROCm installation (default `/opt/rocm`).
`TORCH_CUDA_ARCH_LIST`: CUDA compute capabilities to target (default `"7.0;7.5;8.0;8.6;8.9;9.0;10.0"`).

Quick Install

# Standard CUDA install (requires CUDA 12.x toolkit)
pip install lmcache

# Install from source (recommended for version flexibility)
pip install -e . --no-build-isolation

# For AMD ROCm builds
BUILD_WITH_HIP=1 pip install -e . --no-build-isolation

# Skip CUDA extensions (CPU-only)
NO_CUDA_EXT=1 pip install -e .

Code Evidence

CUDA availability check with C extension fallback from `lmcache/v1/memory_management.py:26-31`:

if torch.cuda.is_available():
    # First Party
    import lmcache.c_ops as lmc_ops
else:
    # First Party
    import lmcache.non_cuda_equivalents as lmc_ops

Build system GPU architecture targeting from `pyproject.toml:158`:

# see https://developer.nvidia.com/cuda-gpus for compute capabilities
# 7.0:  V100
# 7.5:  T4
# 8.0:  A100, A30
# 8.6:  A40, A10, A16, A2
# 8.9:  L4, L40, L40S
# 9.0:  GH200, H200, H100
# 10.0: GB200, B200
environment = {TORCH_CUDA_ARCH_LIST = "7.0;7.5;8.0;8.6;8.9;9.0;10.0"}

HIP/ROCm build path from `setup.py:19,133`:

BUILD_WITH_HIP = os.environ.get("BUILD_WITH_HIP", "0") == "1"
# ...
define_macros = [("__HIP_PLATFORM_HCC__", "1"), ("USE_ROCM", "1")]

XPU device detection from `lmcache/v1/gpu_connector/__init__.py:88-97`:

if dev_name == "cuda":
    if config.use_gpu_connector_v3:
        return VLLMPagedMemGPUConnectorV3.from_metadata(metadata, use_gpu, device)
    else:
        return VLLMPagedMemGPUConnectorV2.from_metadata(metadata, use_gpu, device)
elif dev_name == "xpu":
    from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
    return VLLMPagedMemXPUConnectorV2.from_metadata(metadata, use_gpu, device)

Common Errors

Error Message	Cause	Solution
`ImportError: lmcache.c_ops`	CUDA extensions not compiled	Rebuild with CUDA toolkit installed: `pip install -e . --no-build-isolation`
`RuntimeError: No supported connector found for the current platform`	Unsupported GPU platform	Ensure NVIDIA CUDA, AMD ROCm, or Intel XPU runtime is available
`CUDA out of memory`	Insufficient GPU VRAM	Reduce `max_local_cpu_size` or use CPU-only backends
`torch.cuda.is_available() returns False`	No CUDA runtime detected	Install CUDA toolkit 12.x and verify with `nvidia-smi`

Compatibility Notes

NVIDIA CUDA: Primary supported platform. Requires compute capability >= 7.0.
AMD ROCm (HIP): Supported via `BUILD_WITH_HIP=1`. Uses hipify to convert CUDA sources. Set `CXX=hipcc` during build.
Intel XPU: Supported through vLLM platform detection. Uses separate `VLLMPagedMemXPUConnectorV2` connector.
CPU-only: Falls back to `non_cuda_equivalents` module. No pinned memory, no NUMA awareness, no GPU memory management.
Build isolation: Recommended to use `--no-build-isolation` to avoid torch version conflicts with serving engines.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment