Workflow:LMCache LMCache P2P KV Cache Sharing

Knowledge Sources	LMCache LMCache Docs P2P Sharing Docs Share KV Cache Quickstart
Domains	LLM_Serving, KV_Cache, Distributed_Systems
Last Updated	2026-02-09 00:00 GMT

Overview

End-to-end process for sharing KV caches between multiple vLLM instances using LMCache's peer-to-peer architecture with NIXL transfers and a centralized controller for discovery.

Description

This workflow enables KV cache sharing across multiple vLLM serving instances, so that a prompt processed by one instance can have its cached KV tensors reused by another instance serving the same prompt. It uses a decentralized peer-to-peer transfer model: a centralized LMCache controller coordinates discovery (which instance has which cached chunks), while the actual KV data flows directly between instances via NIXL. This is particularly valuable for multi-instance deployments behind a load balancer, where the same prompt may be routed to different instances across requests.

Usage

Execute this workflow when you have multiple vLLM instances serving the same model and want to share cached KV data between them to reduce redundant computation. This is ideal for load-balanced deployments where repeat prompts (e.g., common system prompts, popular document contexts) may be routed to any instance. Requires at least 2 GPUs and NIXL installed.

Execution Steps

Step 1: Launch the LMCache Controller

Start the LMCache controller API server, which provides centralized coordination for the P2P cache sharing cluster. The controller maintains a registry of all connected vLLM worker instances, tracks which cache chunks exist on each worker, and directs cache lookup/transfer operations. Configure the controller's HTTP port (for API access) and ZMQ port (for worker communication).

Key considerations:

The controller is a lightweight service that manages metadata only, not KV data
Workers register with the controller via heartbeat messages
The controller supports full-sync commands for consistent state across workers
A web UI dashboard is available for monitoring cluster status

Step 2: Configure LMCache for P2P Mode

Create LMCache configuration files for each vLLM instance that specify P2P sharing mode. Each config includes the controller's address, NIXL ports for direct peer transfers, and a unique worker identity. The configuration enables the P2P storage backend which handles both local caching and cross-instance retrieval via the controller's chunk registry.

Key considerations:

Each instance needs unique NIXL init and lookup port pairs
The controller address must be reachable from all instances
Set PYTHONHASHSEED=0 for consistent cross-process token hashing
Local CPU caching can be enabled alongside P2P for a multi-tier setup

Step 3: Launch vLLM Instances with LMCache

Start multiple vLLM instances, each on its own GPU, with the LMCache connector and P2P configuration. Each instance initializes its LMCache manager, registers with the controller as a worker, and begins serving requests. The LMCache connector transparently intercepts KV cache operations within the vLLM engine.

Key considerations:

Each instance operates independently as a full vLLM server
Worker registration with the controller happens automatically at startup
Instances can be added or removed from the cluster dynamically
Health monitoring detects and handles worker failures

Step 4: Process Requests with Cross-Instance Cache Reuse

Send inference requests to any instance in the cluster. When a request arrives, LMCache first checks local storage, then queries the controller for matching chunks on remote peers. If a peer has the needed KV cache, it is transferred directly via NIXL. If no cache exists, the request is processed normally and the resulting KV cache is stored locally and registered with the controller for future reuse by any instance.

Key considerations:

The first instance to process a prompt populates the cache for the entire cluster
Subsequent requests to any instance benefit from the cached data
NIXL transfers move KV data directly between GPU memories of different instances
Cache lookup adds minimal latency compared to full recomputation savings

Step 5: Monitor and Manage the Cluster

Use the controller's API and web dashboard to monitor cluster health, inspect cache statistics, and perform management operations. Available operations include looking up cache locations, moving caches between instances, pinning important caches to prevent eviction, clearing caches, and compressing cached data.

Key considerations:

The controller exposes REST APIs for all management operations
Cache statistics help identify hot prefixes and optimize allocation
Pin operations ensure critical caches survive eviction pressure
The web UI provides real-time visualization of cluster state

Execution Diagram

GitHub URL

Workflow Repository