Environment:LMCache LMCache VLLM Serving Engine
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, LLM_Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
vLLM serving engine integration environment providing the KVConnectorBase_V1 interface that LMCache hooks into for transparent KV cache management.
Description
LMCache integrates with vLLM as a KV cache connector via the `KVConnectorBase_V1` interface. The integration adapter (`LMCacheConnectorV1Impl` in `vllm_v1_adapter.py`) handles version compatibility across different vLLM releases. LMCache dynamically dispatches between connector implementations based on the detected vLLM version and handles API differences between vLLM releases (e.g., `torch_utils` module location changes). The vLLM integration supports all four LMCache workflows and is the primary deployment mode. SGLang is also supported as an alternative serving engine.
Usage
Use this environment when deploying LMCache with vLLM as the serving engine. The vLLM connector is activated by setting `LMCACHE_CONFIG_FILE` environment variable and launching vLLM with `--kv-connector LMCacheConnectorV1`. This is the standard deployment path for all production use cases including KV cache offloading, disaggregated prefill, P2P sharing, and CacheBlend.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| vLLM | Compatible version | LMCache handles multiple vLLM API versions dynamically |
| GPU | NVIDIA CUDA or Intel XPU | vLLM platform detection selects appropriate backend |
| Python | >= 3.10 | Aligned with LMCache Python requirements |
Dependencies
Python Packages
- `vllm` (provides `KVConnectorBase_V1`, `current_platform`)
- `torch` (version determined by vLLM installation)
- `lmcache` (installed as KV connector plugin)
Credentials
- `LMCACHE_CONFIG_FILE`: Path to LMCache YAML configuration (required for vLLM integration).
- `LMCACHE_FORCE_SKIP_SAVE`: Set to skip all cache save operations (optional runtime override).
Quick Install
# Install vLLM (determines torch version)
pip install vllm
# Install LMCache (from source, preserving vLLM's torch)
pip install -e . --no-build-isolation
# Launch vLLM with LMCache connector
LMCACHE_CONFIG_FILE=config.yaml vllm serve model_name --kv-connector LMCacheConnectorV1
Code Evidence
vLLM version import from `lmcache/integration/vllm/vllm_v1_adapter.py:23`:
from vllm.version import __version__ as VLLM_VERSION
vLLM API version compatibility from `lmcache/integration/vllm/utils.py:174-179`:
try:
from vllm.utils.torch_utils import ...
except ImportError:
from vllm.utils import ...
Platform detection for device selection from `lmcache/integration/vllm/utils.py:293-310`:
def get_torch_device():
# Detects CUDA vs XPU via vLLM's current_platform
if current_platform.is_cuda():
torch_dev = torch.cuda
elif current_platform.is_xpu():
torch_dev = torch.xpu
Tensor parallel assumption from `lmcache/integration/vllm/utils.py:318-343`:
"""
Current assumption (TODO: add custom logic in the future):
- Tensor Parallel is intra-node
- Pipeline Parallel is inter-node
"""
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: vllm.utils.torch_utils` | Older vLLM version with different API | LMCache handles this automatically via fallback import |
| `ImportError: vllm.platforms` | vLLM version lacks platform module | LMCache falls back gracefully; GPU detection may be limited |
| KV connector not recognized | vLLM version too old | Update vLLM to a version supporting `KVConnectorBase_V1` |
| `LMCACHE_CONFIG_FILE` not set | Missing configuration | Set the environment variable to a valid YAML config path |
Compatibility Notes
- vLLM API versions: LMCache dynamically handles multiple vLLM versions via try/except import patterns. No specific vLLM version is pinned.
- SGLang alternative: LMCache also integrates with SGLang via `lmcache/integration/sglang/sglang_adapter.py` using a similar config mechanism.
- Torch version: LMCache intentionally does not pin torch at runtime to avoid conflicts with the serving engine's torch version.
- Multi-process mode: vLLM multi-process deployments use `vllm_multi_process_adapter.py` for cross-process KV cache coordination.