Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:LMCache LMCache SGLang Adapter

From Leeroopedia


Knowledge Sources
Domains SGLang Integration, KV Cache Management
Last Updated 2026-02-09 00:00 GMT

Overview

This module provides adapter classes that integrate the LMCache engine with the SGLang inference framework for KV cache loading, storing, and layer-wise retrieval.

Description

The sglang_adapter.py module defines the connector classes (LMCacheConnector and LMCacheLayerwiseConnector) that bridge LMCache's caching engine with SGLang's runtime. The module initializes the LMCache engine using SGLang's model configuration, constructs the appropriate KV shape metadata, and provides worker-side APIs for loading and storing KV cache data. The layerwise variant supports incremental per-layer retrieval and storage with tensor-parallel synchronization.

Usage

Import and instantiate these connectors within an SGLang worker process to enable KV cache sharing through LMCache. The LMCacheConnector provides bulk load/store, while LMCacheLayerwiseConnector supports per-layer streaming retrieval for pipelined execution.

Code Reference

Source Location

Signature

@dataclass
class StoreMetadata:
    last_node: Any
    token_ids: List[int]
    kv_indices: torch.Tensor
    offset: int

@dataclass
class LoadMetadata:
    token_ids: List[int]
    slot_mapping: torch.Tensor
    offset: int

def init_lmcache_engine(
    model_config: ModelConfig, tp_size: int, local_rank: int,
    global_rank: int, kv_dtype: torch.dtype,
) -> LMCacheEngine: ...

class LMCacheConnector:
    def __init__(
        self, sgl_config: ModelConfig, tp_size: int, rank: int,
        k_pool: List[torch.Tensor], v_pool: List[torch.Tensor],
    ): ...
    def load_kv(self, load_metadata: LoadMetadata) -> int: ...
    def store_kv(self, store_metadata: StoreMetadata) -> None: ...
    def get_kv_events(self) -> Iterable[CacheStoreEvent]: ...
    def chunk_size(self): ...
    def reset(self): ...
    def close(self): ...

class LMCacheLayerwiseConnector(LMCacheConnector):
    def __init__(
        self, sgl_config: ModelConfig, tp_size: int, rank: int,
        k_pool: List[torch.Tensor], v_pool: List[torch.Tensor],
        tp_group: Optional[torch.distributed.ProcessGroup] = None,
    ): ...
    def global_min_tokens(
        self, local_tokens: int, tp_group: dist.ProcessGroup,
        device: torch.device,
    ): ...
    def load_kv_layerwise(self, layer_id: int) -> None: ...
    def start_load_kv(self, load_metadata: LoadMetadata) -> int: ...
    def store_kv(self, store_metadata: StoreMetadata) -> None: ...

Import

from lmcache.integration.sglang.sglang_adapter import (
    LMCacheConnector,
    LMCacheLayerwiseConnector,
    StoreMetadata,
    LoadMetadata,
    init_lmcache_engine,
)

I/O Contract

Inputs

Name Type Required Description
sgl_config ModelConfig Yes SGLang model configuration containing layer count, head dimensions, etc.
tp_size int Yes Tensor parallel size
rank int Yes Global tensor parallel rank
k_pool List[torch.Tensor] Yes Key cache tensor pool from SGLang
v_pool List[torch.Tensor] Yes Value cache tensor pool from SGLang
tp_group ProcessGroup No Torch distributed process group for tensor parallel synchronization (layerwise only)

Outputs

Name Type Description
num_retrieved_tokens int Number of tokens successfully retrieved from cache in load_kv / start_load_kv
CacheStoreEvent Iterable Events generated during KV cache store operations

Usage Examples

from lmcache.integration.sglang.sglang_adapter import (
    LMCacheConnector, LoadMetadata, StoreMetadata,
)

# Initialize connector with SGLang model config and KV pools
connector = LMCacheConnector(
    sgl_config=model_config,
    tp_size=1,
    rank=0,
    k_pool=k_pool_tensors,
    v_pool=v_pool_tensors,
)

# Load KV cache for a request
load_meta = LoadMetadata(
    token_ids=[1, 2, 3, 4, 5],
    slot_mapping=slot_tensor,
    offset=0,
)
num_loaded = connector.load_kv(load_meta)

# Store KV cache after forward pass
store_meta = StoreMetadata(
    last_node=None,
    token_ids=[1, 2, 3, 4, 5],
    kv_indices=index_tensor,
    offset=0,
)
connector.store_kv(store_meta)

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment