Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LMCache LMCache LMCBlender Blend

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Attention_Mechanisms
Last Updated 2026-02-09 00:00 GMT

Overview

Concrete tool for blending cached and recomputed KV values with RoPE position recovery, provided by the LMCBlender class.

Description

The LMCBlender.blend method orchestrates the CacheBlend algorithm: it retrieves cached segment KV from the engine, then for each layer calls blend_layer which invokes process_qkv. process_qkv applies FusedRope.fused_encode to correct K positions, computes divergence at check layers, and selectively recomputes the most divergent positions.

Usage

Called by the vLLM connector during the blend code path. The blender intercepts the retrieve operation to apply RoPE correction and selective recomputation.

Code Reference

Source Location

  • Repository: LMCache
  • File: lmcache/v1/compute/blend/blender.py
  • Lines: L24-L168

Signature

class LMCBlender:
    def __init__(
        self,
        cache_engine: LMCacheEngine,
        gpu_connector: GPUConnectorInterface,
        config: LMCacheEngineConfig,
    ):
        """Initialize blender with cache engine and config."""

    def blend(
        self,
        tokens: Union[torch.Tensor, list[int]],
        mask: Optional[torch.Tensor] = None,
        **kwargs,
    ) -> None:
        """Run CacheBlend: retrieve cached KV, correct RoPE, selective recompute.

        Args:
            tokens: Input token IDs with segments separated by blend_special_str
            mask: Optional retrieval mask
            **kwargs: KV cache buffers and page tables
        """

    def process_qkv(
        self,
        q: torch.Tensor,
        k: torch.Tensor,
        v: torch.Tensor,
        residual: torch.Tensor,
        layer_id: int,
        attn_output: torch.Tensor,
        attn_metadata: Any,
    ) -> None:
        """Per-layer QKV processing with RoPE correction and divergence check."""

Import

from lmcache.v1.compute.blend.blender import LMCBlender

I/O Contract

Inputs

Name Type Required Description
tokens Union[torch.Tensor, list[int]] Yes Token IDs with separator-delimited segments
mask Optional[torch.Tensor] No Retrieval mask
**kwargs dict Yes KV buffers, page tables, attention metadata

Outputs

Name Type Description
(none) None Blended KV cache written to GPU buffers with corrected positions

Usage Examples

CacheBlend Inference Flow

# Conceptual flow (handled internally by connector):
# 1. First request stores segment KV caches
#    prompt = sys_prompt + sep + chunk1 + sep + chunk2 + question
#    engine.store(tokens)  # Each segment cached independently

# 2. Second request with reordered segments
#    prompt = sys_prompt + sep + chunk2 + sep + chunk1 + question
#    blender.blend(tokens)  # Retrieves cached segments, corrects RoPE

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment