Implementation:LMCache LMCache LMCacheEngine Retrieve

Knowledge Sources	LMCache
Domains	Caching, Inference_Optimization
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for retrieving cached KV tensors and loading them into GPU memory, provided by the LMCacheEngine class.

Description

The LMCacheEngine.retrieve method processes an incoming request's tokens, looks up matching chunks in the storage backends, and loads the corresponding KV tensors into the serving engine's GPU paged KV buffer. It returns a boolean mask indicating which token positions were successfully loaded from cache (and thus do not need prefill computation).

Usage

This method is called by the vLLM connector before each inference forward pass. It is the primary mechanism for reducing TTFT through KV cache reuse.

Code Reference

Source Location

Repository: LMCache
File: lmcache/v1/cache_engine.py
Lines: L707-L849

Signature

@torch.inference_mode()
def retrieve(
    self,
    tokens: Union[torch.Tensor, list[int]],
    mask: Optional[torch.Tensor] = None,
    **kwargs,
) -> torch.Tensor:
    """Retrieve KV caches and load them into GPU.

    Args:
        tokens: The tokens of the corresponding KV caches.
        mask: Optional boolean mask (FFFFFTTTTTTT format).
        **kwargs: KV cache specific info (paged KV buffer, page tables).

    Returns:
        Boolean mask indicating which tokens were loaded from cache (on CPU).
    """

Import

from lmcache.v1.cache_engine import LMCacheEngine, LMCacheEngineBuilder

I/O Contract

Inputs

Name	Type	Required	Description
tokens	Union[torch.Tensor, list[int]]	Yes	Token IDs for the new request
mask	Optional[torch.Tensor]	No	Boolean mask for selective retrieval
**kwargs	dict	Yes	Paged KV buffer and page tables from serving engine

Outputs

Name	Type	Description
return	torch.Tensor	Boolean mask (CPU) indicating tokens loaded from cache

Usage Examples

Check Cache Hit Ratio

import torch
from lmcache.v1.cache_engine import LMCacheEngineBuilder

engine = LMCacheEngineBuilder.get("lmcache")

tokens = torch.tensor([1, 2, 3, 4, ...], dtype=torch.long)
ret_mask = engine.retrieve(
    tokens=tokens,
    kv_caches=kv_buffer,
    slot_mapping=slot_mapping,
)

cached_tokens = ret_mask.sum().item()
total_tokens = len(tokens)
print(f"Cache hit: {cached_tokens}/{total_tokens} tokens")

Related Pages

Implements Principle

Principle:LMCache_LMCache_KV_Cache_Retrieval

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment