Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LMCache LMCache XPU Connector

From Leeroopedia


Knowledge Sources
Domains GPU Connector, KV Cache Transfer
Last Updated 2026-02-09 00:00 GMT

Overview

Implements the XPU (Intel GPU) variant of the vLLM paged-memory GPU connector for transferring KV cache data between host memory objects and XPU device KV caches.

Description

VLLMPagedMemXPUConnectorV2 extends VLLMPagedMemGPUConnectorV2 to support Intel XPU devices for KV cache transfers. It handles both standard MHA (Multi-Head Attention) KV caches in KV_2LTD format and MLA (Multi-Latent Attention) caches in KV_MLA_FMT format. The to_gpu method copies data from a host MemoryObj into the device-resident paged KV caches using slot-mapped indexing (index_copy_). The from_gpu method extracts data from device KV caches into a host MemoryObj using index_select and forces XPU synchronization when the target buffer is not on XPU. The class can be constructed directly or via from_metadata which extracts shape parameters from LMCacheMetadata. An optional GPU intermediate buffer can be created for chunk-sized transfers.

Usage

Use this connector when running LMCache with Intel XPU devices and vLLM's paged KV cache layout. Instantiate via from_metadata for automatic configuration or directly with explicit dimensions. Call to_gpu during cache loading and from_gpu during cache saving, passing the vLLM kvcaches and slot_mapping as keyword arguments.

Code Reference

Source Location

Signature

class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):
    def __init__(self, hidden_dim_size: int, num_layers: int,
                 use_gpu: bool = False, **kwargs) -> None: ...
    @classmethod
    def from_metadata(cls, metadata: LMCacheMetadata,
                      use_gpu: bool = False,
                      device: Optional[torch.device] = None) -> "VLLMPagedMemXPUConnectorV2": ...
    def to_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
    def from_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
    def batched_to_gpu(self, memory_objs, starts, ends, **kwargs) -> None: ...

Import

from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2

I/O Contract

Inputs

Name Type Required Description
hidden_dim_size int Yes Product of num_kv_heads and head_size
num_layers int Yes Number of transformer layers
use_gpu bool No Whether to create a GPU intermediate buffer (default: False)
memory_obj MemoryObj Yes Host memory object with KV data (tensor must not be None)
start int Yes Start index into the slot_mapping for the token range
end int Yes End index into the slot_mapping for the token range
kvcaches (kwarg) List[torch.Tensor] Yes vLLM paged KV cache tensors on device
slot_mapping (kwarg) torch.Tensor Yes Full slot mapping tensor for the token sequence
metadata LMCacheMetadata Yes Model metadata for from_metadata factory method

Outputs

Name Type Description
(none - to_gpu) None Data is written in-place into the kvcaches tensors
(none - from_gpu) None Data is written in-place into the memory_obj.tensor; metadata fmt may be updated
connector VLLMPagedMemXPUConnectorV2 New instance from from_metadata factory

Usage Examples

from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
from lmcache.v1.metadata import LMCacheMetadata

# Create from metadata
connector = VLLMPagedMemXPUConnectorV2.from_metadata(
    metadata=lmcache_metadata,
    use_gpu=True,
    device=torch.device("xpu:0"),
)

# Load KV cache from host to device
connector.to_gpu(
    memory_obj=host_mem_obj,
    start=0,
    end=chunk_size,
    kvcaches=vllm_kv_caches,
    slot_mapping=slot_map,
)

# Save KV cache from device to host
connector.from_gpu(
    memory_obj=host_mem_obj,
    start=0,
    end=chunk_size,
    kvcaches=vllm_kv_caches,
    slot_mapping=slot_map,
)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment