Principle:LMCache LMCache KV Cache Retrieval
| Knowledge Sources | |
|---|---|
| Domains | Caching, Inference_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
A prefix-matching retrieval strategy that loads cached KV tensors from storage into GPU memory to skip redundant computation during inference.
Description
KV Cache Retrieval is the inverse of storage: given a new request's token sequence, find which prefix chunks are already cached, load their KV tensors from storage, and inject them into the serving engine's GPU KV buffer. This directly reduces Time-To-First-Token (TTFT) by avoiding redundant prefill computation for cached prefix tokens.
The retrieval flow:
- Token database processes the new request's tokens to identify chunk keys
- Storage manager checks which chunks exist in local CPU, disk, or remote backends
- Found memory objects are loaded and reordered for GPU injection
- GPU connector writes the KV tensors into vLLM's paged GPU KV buffer
- A boolean mask is returned indicating which token positions were loaded from cache
Usage
Use this principle to reduce TTFT for requests that share common prefixes (system prompts, few-shot examples, repeated context). It is triggered automatically by the vLLM connector before each inference forward pass via start_load_kv.
Theoretical Basis
Retrieval uses longest prefix matching:
# Pseudocode for prefix-based retrieval
ret_mask = zeros(len(tokens), dtype=bool)
for start, end, key in token_database.process_tokens(tokens):
memory_obj = storage_manager.get(key)
if memory_obj is not None:
ret_mask[start:end] = True
reordered_chunks.append((key, memory_obj, start, end))
else:
break # Prefix chain broken - stop
The prefix-chain property ensures that chunks are only valid if all preceding chunks are also cached.