Principle:LMCache LMCache KV Cache Retrieval

Knowledge Sources	LMCache Efficient Memory Management for LLM Serving
Domains	Caching, Inference_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

A prefix-matching retrieval strategy that loads cached KV tensors from storage into GPU memory to skip redundant computation during inference.

Description

KV Cache Retrieval is the inverse of storage: given a new request's token sequence, find which prefix chunks are already cached, load their KV tensors from storage, and inject them into the serving engine's GPU KV buffer. This directly reduces Time-To-First-Token (TTFT) by avoiding redundant prefill computation for cached prefix tokens.

The retrieval flow:

Token database processes the new request's tokens to identify chunk keys
Storage manager checks which chunks exist in local CPU, disk, or remote backends
Found memory objects are loaded and reordered for GPU injection
GPU connector writes the KV tensors into vLLM's paged GPU KV buffer
A boolean mask is returned indicating which token positions were loaded from cache

Usage

Use this principle to reduce TTFT for requests that share common prefixes (system prompts, few-shot examples, repeated context). It is triggered automatically by the vLLM connector before each inference forward pass via start_load_kv.

Theoretical Basis

Retrieval uses longest prefix matching:

# Pseudocode for prefix-based retrieval
ret_mask = zeros(len(tokens), dtype=bool)
for start, end, key in token_database.process_tokens(tokens):
    memory_obj = storage_manager.get(key)
    if memory_obj is not None:
        ret_mask[start:end] = True
        reordered_chunks.append((key, memory_obj, start, end))
    else:
        break  # Prefix chain broken - stop

The prefix-chain property ensures that chunks are only valid if all preceding chunks are also cached.

Related Pages

Implemented By

Implementation:LMCache_LMCache_LMCacheEngine_Retrieve

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment