Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:LMCache LMCache Prefix Based Retrieval Pattern

From Leeroopedia



Knowledge Sources
Domains Caching, LLMs, Architecture
Last Updated 2026-02-09 00:00 GMT

Overview

Architectural pattern where KV cache retrieval always follows prefix order (tokens 0-t1 from backend 1, t1-t2 from backend 2), with suffix chunks expected to be evicted more frequently than prefix chunks.

Description

LMCache's multi-tier storage system assumes that retrieval is always prefix-based: earlier token chunks (prefixes) are retrieved from faster backends (GPU, CPU) while later chunks (suffixes) may come from slower backends (disk, remote). This ordering reflects the observation that prefix tokens (system prompts, shared context) have higher reuse probability than suffix tokens (user-specific queries). The single-worker-per-engine assumption simplifies synchronization by ensuring no concurrent lookups race on the same token database, and the suffix-to-prefix ordering in the local CPU backend dictionary maintains optimal eviction order.

Usage

This heuristic informs cache tier placement and eviction strategy. When configuring multi-tier storage (CPU + disk + remote), place frequently shared system prompts as prefix tokens for best cache hit rates. The LRU eviction policy works well with this pattern because prefix chunks are accessed more recently during each request. For CacheBlend (non-prefix reuse), this assumption is explicitly relaxed via the segment-based token database.

The Insight (Rule of Thumb)

  • Action: Design token sequences so shared context appears as prefixes (earliest tokens).
  • Value: Prefix chunks get highest priority in local storage; suffix chunks are evicted first.
  • Trade-off: Non-prefix cache reuse (e.g., shared suffixes across requests) is less efficient under this model. CacheBlend addresses this limitation.
  • Architecture: One worker per cache engine simplifies synchronization; no concurrent lookup races.

Reasoning

LLM serving workloads naturally have prefix-heavy patterns: system prompts, few-shot examples, and shared context are token prefixes reused across many requests, while the user-specific query at the end varies. By assuming prefix-based retrieval, the storage manager can optimize the common case: read prefix chunks from fast local CPU storage and fall through to slower backends only for rarely-reused suffix chunks. The suffix-first eviction strategy (maintained via dictionary ordering in `local_cpu_backend.py`) ensures the most reusable chunks survive longest.

Code Evidence

Prefix-based retrieval assumption from `lmcache/v1/storage_backend/storage_manager.py:669-673`:

# NOTE(Jiayi): Currently, the retrieval pattern is always
# prefix-based. That is, we retrieve 0-t1 tokens from backend 1
# and retrieve t1-t2 tokens from backend 2, etc. The assumption
# here is that the suffix chunks are more likely to be evicted
# than the prefix chunks.

Single worker assumption from `lmcache/v1/storage_backend/local_cpu_backend.py:84-86`:

# to help maintain suffix -> prefix order in the dict
# assumption: only one request is looked up at a time
# (only one worker per cache engine)

Tensor parallel intra-node assumption from `lmcache/integration/vllm/utils.py:318-322`:

"""
Current assumption (TODO: add custom logic in the future):
- Tensor Parallel is intra-node
- Pipeline Parallel is inter-node
"""

PD backend chunk ordering from `lmcache/v1/storage_backend/pd_backend.py:358-364`:

"""
We have the following assumptions:
- The first N-1 memory objects are full chunks, each with
  `full_chunk_size` tokens.
- The last memory object can be a partial chunk, which has
  `last_chunk_toks` tokens.
"""

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment