Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LMCache LMCache LMCacheEngine Retrieve

From Leeroopedia


Knowledge Sources
Domains Caching, Inference_Optimization
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for retrieving cached KV tensors and loading them into GPU memory, provided by the LMCacheEngine class.

Description

The LMCacheEngine.retrieve method processes an incoming request's tokens, looks up matching chunks in the storage backends, and loads the corresponding KV tensors into the serving engine's GPU paged KV buffer. It returns a boolean mask indicating which token positions were successfully loaded from cache (and thus do not need prefill computation).

Usage

This method is called by the vLLM connector before each inference forward pass. It is the primary mechanism for reducing TTFT through KV cache reuse.

Code Reference

Source Location

  • Repository: LMCache
  • File: lmcache/v1/cache_engine.py
  • Lines: L707-L849

Signature

@torch.inference_mode()
def retrieve(
    self,
    tokens: Union[torch.Tensor, list[int]],
    mask: Optional[torch.Tensor] = None,
    **kwargs,
) -> torch.Tensor:
    """Retrieve KV caches and load them into GPU.

    Args:
        tokens: The tokens of the corresponding KV caches.
        mask: Optional boolean mask (FFFFFTTTTTTT format).
        **kwargs: KV cache specific info (paged KV buffer, page tables).

    Returns:
        Boolean mask indicating which tokens were loaded from cache (on CPU).
    """

Import

from lmcache.v1.cache_engine import LMCacheEngine, LMCacheEngineBuilder

I/O Contract

Inputs

Name Type Required Description
tokens Union[torch.Tensor, list[int]] Yes Token IDs for the new request
mask Optional[torch.Tensor] No Boolean mask for selective retrieval
**kwargs dict Yes Paged KV buffer and page tables from serving engine

Outputs

Name Type Description
return torch.Tensor Boolean mask (CPU) indicating tokens loaded from cache

Usage Examples

Check Cache Hit Ratio

import torch
from lmcache.v1.cache_engine import LMCacheEngineBuilder

engine = LMCacheEngineBuilder.get("lmcache")

tokens = torch.tensor([1, 2, 3, 4, ...], dtype=torch.long)
ret_mask = engine.retrieve(
    tokens=tokens,
    kv_caches=kv_buffer,
    slot_mapping=slot_mapping,
)

cached_tokens = ret_mask.sum().item()
total_tokens = len(tokens)
print(f"Cache hit: {cached_tokens}/{total_tokens} tokens")

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment