Workflow:LMCache LMCache KV Cache Offloading
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, KV_Cache, Inference_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
End-to-end process for offloading KV caches from GPU memory to CPU RAM or local disk within a single vLLM instance to reduce Time-To-First-Token (TTFT) for requests with shared prefixes.
Description
This workflow demonstrates the fundamental LMCache use case: KV cache reuse within a single vLLM serving instance. When multiple requests share common prompt prefixes (e.g., system prompts, document contexts in RAG), LMCache stores the computed KV cache tensors to CPU memory or local disk after the first request. Subsequent requests with the same prefix retrieve the cached tensors instead of recomputing them, dramatically reducing TTFT and GPU utilization. The process covers environment configuration, vLLM engine initialization with the LMCache connector, prompt construction with shared prefixes, and cache-accelerated generation.
Usage
Execute this workflow when you have a vLLM-based LLM serving deployment and want to reduce TTFT for workloads with repeated prompt prefixes, such as multi-round question-answering, RAG pipelines with shared document contexts, or chat sessions with long system prompts. This is the starting point for all LMCache deployments and requires only a single GPU.
Execution Steps
Step 1: Configure LMCache Environment
Set up LMCache parameters via environment variables or a YAML configuration file. Key parameters include the chunk size (number of tokens per cache chunk, typically 256), the storage backend selection (CPU RAM or local disk), and memory limits for each storage tier. For disk-based storage, specify a file URI path. These settings control how KV cache tensors are chunked, stored, and retrieved.
Key considerations:
- Chunk size affects cache granularity -- smaller chunks enable finer-grained reuse but increase metadata overhead
- CPU backend is faster but limited by available system RAM
- Disk backend supports larger capacity but has higher latency
- Memory limits prevent unbounded growth of cached data
Step 2: Initialize vLLM with LMCache Connector
Create a vLLM LLM engine instance with the LMCache KV transfer connector enabled. This involves constructing a KVTransferConfig that specifies the LMCacheConnectorV1 as the connector class and sets the role to "kv_both" (meaning the instance both stores and retrieves KV caches). Configure model parameters including maximum sequence length and GPU memory utilization based on available hardware.
Key considerations:
- Use LMCacheConnectorV1 for vLLM v1 (the current recommended version)
- The "kv_both" role enables bidirectional cache operations within a single instance
- GPU memory utilization should leave headroom for both model weights and KV cache working memory
- The connector integrates transparently with vLLM's inference pipeline
Step 3: Process Initial Request and Populate Cache
Send the first inference request containing the shared prefix text. During generation, LMCache intercepts the computed KV cache tensors from the GPU, chunks them according to the configured chunk size, and stores them asynchronously to the configured storage backend (CPU RAM or disk). The prefix token hashes serve as cache keys for later retrieval.
Key considerations:
- The first request runs at normal speed with no cache benefit
- KV cache storage happens asynchronously to minimize impact on generation latency
- Token-level hashing ensures that identical token sequences produce the same cache keys regardless of the surrounding context
- Cache population logs show the number of tokens stored for verification
Step 4: Serve Subsequent Requests with Cache Hits
When new requests arrive with the same prefix, LMCache's lookup client checks the token database for matching cache entries before prefill begins. Matched prefix chunks are retrieved from storage and loaded directly into the GPU KV cache, skipping recomputation. Only the novel suffix tokens require full prefill, resulting in significantly reduced TTFT.
Key considerations:
- Cache hits are logged showing how many tokens were retrieved versus total tokens
- The speedup is proportional to the fraction of the prompt that matches cached prefixes
- Cache retrieval supports layerwise operations for memory efficiency
- Multiple requests can benefit from the same cached prefix simultaneously
Step 5: Clean Up Resources
After serving is complete, destroy the LMCache engine instance to release CPU/disk resources and clean up any temporary storage. This ensures proper resource management in production deployments.
Key considerations:
- Use LMCacheEngineBuilder.destroy() for explicit cleanup
- Context managers in the example code handle cleanup automatically on exit
- Disk-based caches can optionally be preserved across restarts for persistent caching