Workflow:LMCache LMCache P2P KV Cache Sharing
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, KV_Cache, Distributed_Systems |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
End-to-end process for sharing KV caches between multiple vLLM instances using LMCache's peer-to-peer architecture with NIXL transfers and a centralized controller for discovery.
Description
This workflow enables KV cache sharing across multiple vLLM serving instances, so that a prompt processed by one instance can have its cached KV tensors reused by another instance serving the same prompt. It uses a decentralized peer-to-peer transfer model: a centralized LMCache controller coordinates discovery (which instance has which cached chunks), while the actual KV data flows directly between instances via NIXL. This is particularly valuable for multi-instance deployments behind a load balancer, where the same prompt may be routed to different instances across requests.
Usage
Execute this workflow when you have multiple vLLM instances serving the same model and want to share cached KV data between them to reduce redundant computation. This is ideal for load-balanced deployments where repeat prompts (e.g., common system prompts, popular document contexts) may be routed to any instance. Requires at least 2 GPUs and NIXL installed.
Execution Steps
Step 1: Launch the LMCache Controller
Start the LMCache controller API server, which provides centralized coordination for the P2P cache sharing cluster. The controller maintains a registry of all connected vLLM worker instances, tracks which cache chunks exist on each worker, and directs cache lookup/transfer operations. Configure the controller's HTTP port (for API access) and ZMQ port (for worker communication).
Key considerations:
- The controller is a lightweight service that manages metadata only, not KV data
- Workers register with the controller via heartbeat messages
- The controller supports full-sync commands for consistent state across workers
- A web UI dashboard is available for monitoring cluster status
Step 2: Configure LMCache for P2P Mode
Create LMCache configuration files for each vLLM instance that specify P2P sharing mode. Each config includes the controller's address, NIXL ports for direct peer transfers, and a unique worker identity. The configuration enables the P2P storage backend which handles both local caching and cross-instance retrieval via the controller's chunk registry.
Key considerations:
- Each instance needs unique NIXL init and lookup port pairs
- The controller address must be reachable from all instances
- Set PYTHONHASHSEED=0 for consistent cross-process token hashing
- Local CPU caching can be enabled alongside P2P for a multi-tier setup
Step 3: Launch vLLM Instances with LMCache
Start multiple vLLM instances, each on its own GPU, with the LMCache connector and P2P configuration. Each instance initializes its LMCache manager, registers with the controller as a worker, and begins serving requests. The LMCache connector transparently intercepts KV cache operations within the vLLM engine.
Key considerations:
- Each instance operates independently as a full vLLM server
- Worker registration with the controller happens automatically at startup
- Instances can be added or removed from the cluster dynamically
- Health monitoring detects and handles worker failures
Step 4: Process Requests with Cross-Instance Cache Reuse
Send inference requests to any instance in the cluster. When a request arrives, LMCache first checks local storage, then queries the controller for matching chunks on remote peers. If a peer has the needed KV cache, it is transferred directly via NIXL. If no cache exists, the request is processed normally and the resulting KV cache is stored locally and registered with the controller for future reuse by any instance.
Key considerations:
- The first instance to process a prompt populates the cache for the entire cluster
- Subsequent requests to any instance benefit from the cached data
- NIXL transfers move KV data directly between GPU memories of different instances
- Cache lookup adds minimal latency compared to full recomputation savings
Step 5: Monitor and Manage the Cluster
Use the controller's API and web dashboard to monitor cluster health, inspect cache statistics, and perform management operations. Available operations include looking up cache locations, moving caches between instances, pinning important caches to prevent eviction, clearing caches, and compressing cached data.
Key considerations:
- The controller exposes REST APIs for all management operations
- Cache statistics help identify hot prefixes and optimize allocation
- Pin operations ensure critical caches survive eviction pressure
- The web UI provides real-time visualization of cluster state