Implementation:LMCache LMCache LMCBlender Blend

Knowledge Sources	LMCache CacheBlend
Domains	Deep_Learning, Attention_Mechanisms
Last Updated	2026-02-09 00:00 GMT

Overview

Concrete tool for blending cached and recomputed KV values with RoPE position recovery, provided by the LMCBlender class.

Description

The LMCBlender.blend method orchestrates the CacheBlend algorithm: it retrieves cached segment KV from the engine, then for each layer calls blend_layer which invokes process_qkv. process_qkv applies FusedRope.fused_encode to correct K positions, computes divergence at check layers, and selectively recomputes the most divergent positions.

Usage

Called by the vLLM connector during the blend code path. The blender intercepts the retrieve operation to apply RoPE correction and selective recomputation.

Code Reference

Source Location

Repository: LMCache
File: lmcache/v1/compute/blend/blender.py
Lines: L24-L168

Signature

class LMCBlender:
    def __init__(
        self,
        cache_engine: LMCacheEngine,
        gpu_connector: GPUConnectorInterface,
        config: LMCacheEngineConfig,
    ):
        """Initialize blender with cache engine and config."""

    def blend(
        self,
        tokens: Union[torch.Tensor, list[int]],
        mask: Optional[torch.Tensor] = None,
        **kwargs,
    ) -> None:
        """Run CacheBlend: retrieve cached KV, correct RoPE, selective recompute.

        Args:
            tokens: Input token IDs with segments separated by blend_special_str
            mask: Optional retrieval mask
            **kwargs: KV cache buffers and page tables
        """

    def process_qkv(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        residual: torch.Tensor,
        layer_id: int,
        attn_output: torch.Tensor,
        attn_metadata: Any,
    ) -> None:
        """Per-layer QKV processing with RoPE correction and divergence check."""

Import

from lmcache.v1.compute.blend.blender import LMCBlender

I/O Contract

Inputs

Name	Type	Required	Description
tokens	Union[torch.Tensor, list[int]]	Yes	Token IDs with separator-delimited segments
mask	Optional[torch.Tensor]	No	Retrieval mask
**kwargs	dict	Yes	KV buffers, page tables, attention metadata

Outputs

Name	Type	Description
(none)	None	Blended KV cache written to GPU buffers with corrected positions

Usage Examples

CacheBlend Inference Flow

# Conceptual flow (handled internally by connector):
# 1. First request stores segment KV caches
#    prompt = sys_prompt + sep + chunk1 + sep + chunk2 + question
#    engine.store(tokens)  # Each segment cached independently

# 2. Second request with reordered segments
#    prompt = sys_prompt + sep + chunk2 + sep + chunk1 + question
#    blender.blend(tokens)  # Retrieves cached segments, corrects RoPE

Related Pages

Implements Principle

Principle:LMCache_LMCache_RoPE_Position_Recovery

Requires Environment

Environment:LMCache_LMCache_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment