Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LMCache LMCache KV Cache Retrieval

From Leeroopedia


Knowledge Sources
Domains Caching, Inference_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

A prefix-matching retrieval strategy that loads cached KV tensors from storage into GPU memory to skip redundant computation during inference.

Description

KV Cache Retrieval is the inverse of storage: given a new request's token sequence, find which prefix chunks are already cached, load their KV tensors from storage, and inject them into the serving engine's GPU KV buffer. This directly reduces Time-To-First-Token (TTFT) by avoiding redundant prefill computation for cached prefix tokens.

The retrieval flow:

  1. Token database processes the new request's tokens to identify chunk keys
  2. Storage manager checks which chunks exist in local CPU, disk, or remote backends
  3. Found memory objects are loaded and reordered for GPU injection
  4. GPU connector writes the KV tensors into vLLM's paged GPU KV buffer
  5. A boolean mask is returned indicating which token positions were loaded from cache

Usage

Use this principle to reduce TTFT for requests that share common prefixes (system prompts, few-shot examples, repeated context). It is triggered automatically by the vLLM connector before each inference forward pass via start_load_kv.

Theoretical Basis

Retrieval uses longest prefix matching:

# Pseudocode for prefix-based retrieval
ret_mask = zeros(len(tokens), dtype=bool)
for start, end, key in token_database.process_tokens(tokens):
    memory_obj = storage_manager.get(key)
    if memory_obj is not None:
        ret_mask[start:end] = True
        reordered_chunks.append((key, memory_obj, start, end))
    else:
        break  # Prefix chain broken - stop

The prefix-chain property ensures that chunks are only valid if all preceding chunks are also cached.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment