Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:LMCache LMCache VLLM Serving Engine

From Leeroopedia


Knowledge Sources
Domains Infrastructure, LLM_Serving
Last Updated 2026-02-09 00:00 GMT

Overview

vLLM serving engine integration environment providing the KVConnectorBase_V1 interface that LMCache hooks into for transparent KV cache management.

Description

LMCache integrates with vLLM as a KV cache connector via the `KVConnectorBase_V1` interface. The integration adapter (`LMCacheConnectorV1Impl` in `vllm_v1_adapter.py`) handles version compatibility across different vLLM releases. LMCache dynamically dispatches between connector implementations based on the detected vLLM version and handles API differences between vLLM releases (e.g., `torch_utils` module location changes). The vLLM integration supports all four LMCache workflows and is the primary deployment mode. SGLang is also supported as an alternative serving engine.

Usage

Use this environment when deploying LMCache with vLLM as the serving engine. The vLLM connector is activated by setting `LMCACHE_CONFIG_FILE` environment variable and launching vLLM with `--kv-connector LMCacheConnectorV1`. This is the standard deployment path for all production use cases including KV cache offloading, disaggregated prefill, P2P sharing, and CacheBlend.

System Requirements

Category Requirement Notes
vLLM Compatible version LMCache handles multiple vLLM API versions dynamically
GPU NVIDIA CUDA or Intel XPU vLLM platform detection selects appropriate backend
Python >= 3.10 Aligned with LMCache Python requirements

Dependencies

Python Packages

  • `vllm` (provides `KVConnectorBase_V1`, `current_platform`)
  • `torch` (version determined by vLLM installation)
  • `lmcache` (installed as KV connector plugin)

Credentials

  • `LMCACHE_CONFIG_FILE`: Path to LMCache YAML configuration (required for vLLM integration).
  • `LMCACHE_FORCE_SKIP_SAVE`: Set to skip all cache save operations (optional runtime override).

Quick Install

# Install vLLM (determines torch version)
pip install vllm

# Install LMCache (from source, preserving vLLM's torch)
pip install -e . --no-build-isolation

# Launch vLLM with LMCache connector
LMCACHE_CONFIG_FILE=config.yaml vllm serve model_name --kv-connector LMCacheConnectorV1

Code Evidence

vLLM version import from `lmcache/integration/vllm/vllm_v1_adapter.py:23`:

from vllm.version import __version__ as VLLM_VERSION

vLLM API version compatibility from `lmcache/integration/vllm/utils.py:174-179`:

try:
    from vllm.utils.torch_utils import ...
except ImportError:
    from vllm.utils import ...

Platform detection for device selection from `lmcache/integration/vllm/utils.py:293-310`:

def get_torch_device():
    # Detects CUDA vs XPU via vLLM's current_platform
    if current_platform.is_cuda():
        torch_dev = torch.cuda
    elif current_platform.is_xpu():
        torch_dev = torch.xpu

Tensor parallel assumption from `lmcache/integration/vllm/utils.py:318-343`:

"""
Current assumption (TODO: add custom logic in the future):
- Tensor Parallel is intra-node
- Pipeline Parallel is inter-node
"""

Common Errors

Error Message Cause Solution
`ImportError: vllm.utils.torch_utils` Older vLLM version with different API LMCache handles this automatically via fallback import
`ImportError: vllm.platforms` vLLM version lacks platform module LMCache falls back gracefully; GPU detection may be limited
KV connector not recognized vLLM version too old Update vLLM to a version supporting `KVConnectorBase_V1`
`LMCACHE_CONFIG_FILE` not set Missing configuration Set the environment variable to a valid YAML config path

Compatibility Notes

  • vLLM API versions: LMCache dynamically handles multiple vLLM versions via try/except import patterns. No specific vLLM version is pinned.
  • SGLang alternative: LMCache also integrates with SGLang via `lmcache/integration/sglang/sglang_adapter.py` using a similar config mechanism.
  • Torch version: LMCache intentionally does not pin torch at runtime to avoid conflicts with the serving engine's torch version.
  • Multi-process mode: vLLM multi-process deployments use `vllm_multi_process_adapter.py` for cross-process KV cache coordination.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment