Principle:LMCache LMCache VLLM KV Connector Integration
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Serving |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
An integration pattern that bridges an external LLM serving engine with a KV cache management system through a standardized connector interface.
Description
vLLM KV Connector Integration is the pattern of embedding LMCache into vLLM's inference pipeline via the KVConnectorBase_V1 interface. vLLM defines a connector API (KVTransferConfig, KVConnectorRole) that external systems can implement to intercept KV cache operations during inference. LMCache provides LMCacheConnectorV1Dynamic as the entry point that delegates to LMCacheConnectorV1Impl, which initializes the full LMCache stack (manager, cache engine, storage backends, GPU connectors).
This solves the problem of transparent KV cache reuse: by plugging into vLLM's connector API, LMCache can store KV caches from completed requests and retrieve them for new requests with shared prefixes, without modifying vLLM's core inference logic.
Usage
Use this principle when deploying LMCache with vLLM. Specify the connector in vLLM's launch arguments via --kv-transfer-config with kv_connector set to "LMCacheConnectorV1". The connector handles both scheduler-side (token matching, request tracking) and worker-side (KV cache load/save, GPU memory management) operations.
Theoretical Basis
The connector follows a dual-role architecture:
- Scheduler role: Runs on the scheduler process. Handles token matching via get_num_new_matched_tokens, builds connector metadata, and tracks unfinished requests.
- Worker role: Runs on each GPU worker. Handles actual KV cache transfer: start_load_kv (retrieve from cache to GPU), save_kv_layer (store from GPU to cache), wait_for_layer_load (synchronization barrier).
The initialization sequence:
- LMCacheConnectorV1Dynamic.__init__ creates LMCacheConnectorV1Impl
- LMCacheConnectorV1Impl loads LMCacheEngineConfig
- LMCacheManager is created (initializes cache engine, storage backends)
- Connector state is initialized (blender if enabled, layer tracking, chunk settings)