Implementation:LMCache LMCache LMCBlender Blend
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Attention_Mechanisms |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Concrete tool for blending cached and recomputed KV values with RoPE position recovery, provided by the LMCBlender class.
Description
The LMCBlender.blend method orchestrates the CacheBlend algorithm: it retrieves cached segment KV from the engine, then for each layer calls blend_layer which invokes process_qkv. process_qkv applies FusedRope.fused_encode to correct K positions, computes divergence at check layers, and selectively recomputes the most divergent positions.
Usage
Called by the vLLM connector during the blend code path. The blender intercepts the retrieve operation to apply RoPE correction and selective recomputation.
Code Reference
Source Location
- Repository: LMCache
- File: lmcache/v1/compute/blend/blender.py
- Lines: L24-L168
Signature
class LMCBlender:
def __init__(
self,
cache_engine: LMCacheEngine,
gpu_connector: GPUConnectorInterface,
config: LMCacheEngineConfig,
):
"""Initialize blender with cache engine and config."""
def blend(
self,
tokens: Union[torch.Tensor, list[int]],
mask: Optional[torch.Tensor] = None,
**kwargs,
) -> None:
"""Run CacheBlend: retrieve cached KV, correct RoPE, selective recompute.
Args:
tokens: Input token IDs with segments separated by blend_special_str
mask: Optional retrieval mask
**kwargs: KV cache buffers and page tables
"""
def process_qkv(
self,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
residual: torch.Tensor,
layer_id: int,
attn_output: torch.Tensor,
attn_metadata: Any,
) -> None:
"""Per-layer QKV processing with RoPE correction and divergence check."""
Import
from lmcache.v1.compute.blend.blender import LMCBlender
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokens | Union[torch.Tensor, list[int]] | Yes | Token IDs with separator-delimited segments |
| mask | Optional[torch.Tensor] | No | Retrieval mask |
| **kwargs | dict | Yes | KV buffers, page tables, attention metadata |
Outputs
| Name | Type | Description |
|---|---|---|
| (none) | None | Blended KV cache written to GPU buffers with corrected positions |
Usage Examples
CacheBlend Inference Flow
# Conceptual flow (handled internally by connector):
# 1. First request stores segment KV caches
# prompt = sys_prompt + sep + chunk1 + sep + chunk2 + question
# engine.store(tokens) # Each segment cached independently
# 2. Second request with reordered segments
# prompt = sys_prompt + sep + chunk2 + sep + chunk1 + question
# blender.blend(tokens) # Retrieves cached segments, corrects RoPE