Implementation:LMCache LMCache LMCacheEngine Retrieve
| Knowledge Sources | |
|---|---|
| Domains | Caching, Inference_Optimization |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for retrieving cached KV tensors and loading them into GPU memory, provided by the LMCacheEngine class.
Description
The LMCacheEngine.retrieve method processes an incoming request's tokens, looks up matching chunks in the storage backends, and loads the corresponding KV tensors into the serving engine's GPU paged KV buffer. It returns a boolean mask indicating which token positions were successfully loaded from cache (and thus do not need prefill computation).
Usage
This method is called by the vLLM connector before each inference forward pass. It is the primary mechanism for reducing TTFT through KV cache reuse.
Code Reference
Source Location
- Repository: LMCache
- File: lmcache/v1/cache_engine.py
- Lines: L707-L849
Signature
@torch.inference_mode()
def retrieve(
self,
tokens: Union[torch.Tensor, list[int]],
mask: Optional[torch.Tensor] = None,
**kwargs,
) -> torch.Tensor:
"""Retrieve KV caches and load them into GPU.
Args:
tokens: The tokens of the corresponding KV caches.
mask: Optional boolean mask (FFFFFTTTTTTT format).
**kwargs: KV cache specific info (paged KV buffer, page tables).
Returns:
Boolean mask indicating which tokens were loaded from cache (on CPU).
"""
Import
from lmcache.v1.cache_engine import LMCacheEngine, LMCacheEngineBuilder
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokens | Union[torch.Tensor, list[int]] | Yes | Token IDs for the new request |
| mask | Optional[torch.Tensor] | No | Boolean mask for selective retrieval |
| **kwargs | dict | Yes | Paged KV buffer and page tables from serving engine |
Outputs
| Name | Type | Description |
|---|---|---|
| return | torch.Tensor | Boolean mask (CPU) indicating tokens loaded from cache |
Usage Examples
Check Cache Hit Ratio
import torch
from lmcache.v1.cache_engine import LMCacheEngineBuilder
engine = LMCacheEngineBuilder.get("lmcache")
tokens = torch.tensor([1, 2, 3, 4, ...], dtype=torch.long)
ret_mask = engine.retrieve(
tokens=tokens,
kv_caches=kv_buffer,
slot_mapping=slot_mapping,
)
cached_tokens = ret_mask.sum().item()
total_tokens = len(tokens)
print(f"Cache hit: {cached_tokens}/{total_tokens} tokens")