Implementation:LMCache LMCache XPU Connector
| Knowledge Sources | |
|---|---|
| Domains | GPU Connector, KV Cache Transfer |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Implements the XPU (Intel GPU) variant of the vLLM paged-memory GPU connector for transferring KV cache data between host memory objects and XPU device KV caches.
Description
VLLMPagedMemXPUConnectorV2 extends VLLMPagedMemGPUConnectorV2 to support Intel XPU devices for KV cache transfers. It handles both standard MHA (Multi-Head Attention) KV caches in KV_2LTD format and MLA (Multi-Latent Attention) caches in KV_MLA_FMT format. The to_gpu method copies data from a host MemoryObj into the device-resident paged KV caches using slot-mapped indexing (index_copy_). The from_gpu method extracts data from device KV caches into a host MemoryObj using index_select and forces XPU synchronization when the target buffer is not on XPU. The class can be constructed directly or via from_metadata which extracts shape parameters from LMCacheMetadata. An optional GPU intermediate buffer can be created for chunk-sized transfers.
Usage
Use this connector when running LMCache with Intel XPU devices and vLLM's paged KV cache layout. Instantiate via from_metadata for automatic configuration or directly with explicit dimensions. Call to_gpu during cache loading and from_gpu during cache saving, passing the vLLM kvcaches and slot_mapping as keyword arguments.
Code Reference
Source Location
- Repository: LMCache
- File: lmcache/v1/gpu_connector/xpu_connectors.py
- Lines: 1-243
Signature
class VLLMPagedMemXPUConnectorV2(VLLMPagedMemGPUConnectorV2):
def __init__(self, hidden_dim_size: int, num_layers: int,
use_gpu: bool = False, **kwargs) -> None: ...
@classmethod
def from_metadata(cls, metadata: LMCacheMetadata,
use_gpu: bool = False,
device: Optional[torch.device] = None) -> "VLLMPagedMemXPUConnectorV2": ...
def to_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
def from_gpu(self, memory_obj: MemoryObj, start: int, end: int, **kwargs) -> None: ...
def batched_to_gpu(self, memory_objs, starts, ends, **kwargs) -> None: ...
Import
from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_dim_size | int | Yes | Product of num_kv_heads and head_size |
| num_layers | int | Yes | Number of transformer layers |
| use_gpu | bool | No | Whether to create a GPU intermediate buffer (default: False) |
| memory_obj | MemoryObj | Yes | Host memory object with KV data (tensor must not be None) |
| start | int | Yes | Start index into the slot_mapping for the token range |
| end | int | Yes | End index into the slot_mapping for the token range |
| kvcaches (kwarg) | List[torch.Tensor] | Yes | vLLM paged KV cache tensors on device |
| slot_mapping (kwarg) | torch.Tensor | Yes | Full slot mapping tensor for the token sequence |
| metadata | LMCacheMetadata | Yes | Model metadata for from_metadata factory method |
Outputs
| Name | Type | Description |
|---|---|---|
| (none - to_gpu) | None | Data is written in-place into the kvcaches tensors |
| (none - from_gpu) | None | Data is written in-place into the memory_obj.tensor; metadata fmt may be updated |
| connector | VLLMPagedMemXPUConnectorV2 | New instance from from_metadata factory |
Usage Examples
from lmcache.v1.gpu_connector.xpu_connectors import VLLMPagedMemXPUConnectorV2
from lmcache.v1.metadata import LMCacheMetadata
# Create from metadata
connector = VLLMPagedMemXPUConnectorV2.from_metadata(
metadata=lmcache_metadata,
use_gpu=True,
device=torch.device("xpu:0"),
)
# Load KV cache from host to device
connector.to_gpu(
memory_obj=host_mem_obj,
start=0,
end=chunk_size,
kvcaches=vllm_kv_caches,
slot_mapping=slot_map,
)
# Save KV cache from device to host
connector.from_gpu(
memory_obj=host_mem_obj,
start=0,
end=chunk_size,
kvcaches=vllm_kv_caches,
slot_mapping=slot_map,
)