Implementation:LMCache LMCache SGLang Adapter

Knowledge Sources	LMCache
Domains	SGLang Integration, KV Cache Management
Last Updated	2026-02-09 00:00 GMT

Overview

This module provides adapter classes that integrate the LMCache engine with the SGLang inference framework for KV cache loading, storing, and layer-wise retrieval.

Description

The sglang_adapter.py module defines the connector classes (LMCacheConnector and LMCacheLayerwiseConnector) that bridge LMCache's caching engine with SGLang's runtime. The module initializes the LMCache engine using SGLang's model configuration, constructs the appropriate KV shape metadata, and provides worker-side APIs for loading and storing KV cache data. The layerwise variant supports incremental per-layer retrieval and storage with tensor-parallel synchronization.

Usage

Import and instantiate these connectors within an SGLang worker process to enable KV cache sharing through LMCache. The LMCacheConnector provides bulk load/store, while LMCacheLayerwiseConnector supports per-layer streaming retrieval for pipelined execution.

Code Reference

Source Location

Repository: LMCache
File: lmcache/integration/sglang/sglang_adapter.py
Lines: 1-325

Signature

@dataclass
class StoreMetadata:
    last_node: Any
    token_ids: List[int]
    kv_indices: torch.Tensor
    offset: int

@dataclass
class LoadMetadata:
    token_ids: List[int]
    slot_mapping: torch.Tensor
    offset: int

def init_lmcache_engine(
    model_config: ModelConfig, tp_size: int, local_rank: int,
    global_rank: int, kv_dtype: torch.dtype,
) -> LMCacheEngine: ...

class LMCacheConnector:
    def __init__(
        self, sgl_config: ModelConfig, tp_size: int, rank: int,
        k_pool: List[torch.Tensor], v_pool: List[torch.Tensor],
    ): ...
    def load_kv(self, load_metadata: LoadMetadata) -> int: ...
    def store_kv(self, store_metadata: StoreMetadata) -> None: ...
    def get_kv_events(self) -> Iterable[CacheStoreEvent]: ...
    def chunk_size(self): ...
    def reset(self): ...
    def close(self): ...

class LMCacheLayerwiseConnector(LMCacheConnector):
    def __init__(
        self, sgl_config: ModelConfig, tp_size: int, rank: int,
        k_pool: List[torch.Tensor], v_pool: List[torch.Tensor],
        tp_group: Optional[torch.distributed.ProcessGroup] = None,
    ): ...
    def global_min_tokens(
        self, local_tokens: int, tp_group: dist.ProcessGroup,
        device: torch.device,
    ): ...
    def load_kv_layerwise(self, layer_id: int) -> None: ...
    def start_load_kv(self, load_metadata: LoadMetadata) -> int: ...
    def store_kv(self, store_metadata: StoreMetadata) -> None: ...

Import

from lmcache.integration.sglang.sglang_adapter import (
    LMCacheConnector,
    LMCacheLayerwiseConnector,
    StoreMetadata,
    LoadMetadata,
    init_lmcache_engine,
)

I/O Contract

Inputs

Name	Type	Required	Description
sgl_config	ModelConfig	Yes	SGLang model configuration containing layer count, head dimensions, etc.
tp_size	int	Yes	Tensor parallel size
rank	int	Yes	Global tensor parallel rank
k_pool	List[torch.Tensor]	Yes	Key cache tensor pool from SGLang
v_pool	List[torch.Tensor]	Yes	Value cache tensor pool from SGLang
tp_group	ProcessGroup	No	Torch distributed process group for tensor parallel synchronization (layerwise only)

Outputs

Name	Type	Description
num_retrieved_tokens	int	Number of tokens successfully retrieved from cache in load_kv / start_load_kv
CacheStoreEvent	Iterable	Events generated during KV cache store operations

Usage Examples

from lmcache.integration.sglang.sglang_adapter import (
    LMCacheConnector, LoadMetadata, StoreMetadata,
)

# Initialize connector with SGLang model config and KV pools
connector = LMCacheConnector(
    sgl_config=model_config,
    tp_size=1,
    rank=0,
    k_pool=k_pool_tensors,
    v_pool=v_pool_tensors,
)

# Load KV cache for a request
load_meta = LoadMetadata(
    token_ids=[1, 2, 3, 4, 5],
    slot_mapping=slot_tensor,
    offset=0,
)
num_loaded = connector.load_kv(load_meta)

# Store KV cache after forward pass
store_meta = StoreMetadata(
    last_node=None,
    token_ids=[1, 2, 3, 4, 5],
    kv_indices=index_tensor,
    offset=0,
)
connector.store_kv(store_meta)

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment