Implementation:LMCache LMCache XPU Connector

Knowledge Sources	LMCache
Domains	GPU Connector, KV Cache Transfer
Last Updated	2026-02-09 00:00 GMT

Overview

Implements the XPU (Intel GPU) variant of the vLLM paged-memory GPU connector for transferring KV cache data between host memory objects and XPU device KV caches.

Description

VLLMPagedMemXPUConnectorV2 extends VLLMPagedMemGPUConnectorV2 to support Intel XPU devices for KV cache transfers. It handles both standard MHA (Multi-Head Attention) KV caches in KV_2LTD format and MLA (Multi-Latent Attention) caches in KV_MLA_FMT format. The to_gpu method copies data from a host MemoryObj into the device-resident paged KV caches using slot-mapped indexing (index_copy_). The from_gpu method extracts data from device KV caches into a host MemoryObj using index_select and forces XPU synchronization when the target buffer is not on XPU. The class can be constructed directly or via from_metadata which extracts shape parameters from LMCacheMetadata. An optional GPU intermediate buffer can be created for chunk-sized transfers.

Usage

Use this connector when running LMCache with Intel XPU devices and vLLM's paged KV cache layout. Instantiate via from_metadata for automatic configuration or directly with explicit dimensions. Call to_gpu during cache loading and from_gpu during cache saving, passing the vLLM kvcaches and slot_mapping as keyword arguments.

Code Reference

Source Location

Repository: LMCache
File: lmcache/v1/gpu_connector/xpu_connectors.py
Lines: 1-243

Signature

class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):
    def __init__(self, hidden_dim_size: int, num_layers: int,
                 use_gpu: bool = False, **kwargs) -> None: ...
    @classmethod
    def from_metadata(cls, metadata: LMCacheMetadata,
                      use_gpu: bool = False,
                      device: Optional[torch.device] = None) -> "VLLMPagedMemXPUConnectorV2": ...
    def to_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
    def from_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
    def batched_to_gpu(self, memory_objs, starts, ends, **kwargs) -> None: ...

Import

from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2

I/O Contract

Inputs

Name	Type	Required	Description
hidden_dim_size	int	Yes	Product of num_kv_heads and head_size
num_layers	int	Yes	Number of transformer layers
use_gpu	bool	No	Whether to create a GPU intermediate buffer (default: False)
memory_obj	MemoryObj	Yes	Host memory object with KV data (tensor must not be None)
start	int	Yes	Start index into the slot_mapping for the token range
end	int	Yes	End index into the slot_mapping for the token range
kvcaches (kwarg)	List[torch.Tensor]	Yes	vLLM paged KV cache tensors on device
slot_mapping (kwarg)	torch.Tensor	Yes	Full slot mapping tensor for the token sequence
metadata	LMCacheMetadata	Yes	Model metadata for from_metadata factory method

Outputs

Name	Type	Description
(none - to_gpu)	None	Data is written in-place into the kvcaches tensors
(none - from_gpu)	None	Data is written in-place into the memory_obj.tensor; metadata fmt may be updated
connector	VLLMPagedMemXPUConnectorV2	New instance from from_metadata factory

Usage Examples

from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
from lmcache.v1.metadata import LMCacheMetadata

# Create from metadata
connector = VLLMPagedMemXPUConnectorV2.from_metadata(
    metadata=lmcache_metadata,
    use_gpu=True,
    device=torch.device("xpu:0"),
)

# Load KV cache from host to device
connector.to_gpu(
    memory_obj=host_mem_obj,
    start=0,
    end=chunk_size,
    kvcaches=vllm_kv_caches,
    slot_mapping=slot_map,
)

# Save KV cache from device to host
connector.from_gpu(
    memory_obj=host_mem_obj,
    start=0,
    end=chunk_size,
    kvcaches=vllm_kv_caches,
    slot_mapping=slot_map,
)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment