Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Intel Ipex llm StartRecentKVCache

From Leeroopedia


Knowledge Sources
Domains KV_Cache, Memory_Management, Inference_Optimization
Last Updated 2026-02-09 04:00 GMT

Overview

Concrete tool for managing KV cache memory in streaming LLM inference by retaining only initial and recent tokens.

Description

The StartRecentKVCache class implements a streaming KV cache eviction strategy (adapted from MIT HAN Lab's StreamingLLM) that maintains bounded memory during long-sequence inference. It keeps a configurable number of initial tokens (start_size) and recent tokens (recent_size), evicting middle cache entries when memory pressure occurs. This enables arbitrarily long conversations without running out of KV cache memory, at the cost of losing access to middle-context attention.

Usage

Use this when running interactive chat sessions that may exceed the model's context window. The cache manager transparently evicts old entries while preserving the initial system prompt tokens and recent conversation context.

Code Reference

Source Location

Signature

class StartRecentKVCache:
    def __init__(
        self,
        start_size: int = 4,
        recent_size: int = 512,
        k_seq_dim: int = 2,
        v_seq_dim: int = 2,
    ):
        """Initialize KV cache with start and recent size limits."""

    def __call__(self, past_key_values):
        """Trim cache to retain only start and recent tokens."""

    def evict_for_space(self, past_key_values, num_coming):
        """Evict middle entries to make room for new tokens."""

    def evict_range(self, past_key_values, start, end):
        """Remove cache entries within a specified range."""

Import

from kv_cache import StartRecentKVCache

I/O Contract

Inputs

Name Type Required Description
start_size int No Number of initial tokens to retain (default: 4)
recent_size int No Number of recent tokens to retain (default: 512)
past_key_values tuple Yes (runtime) KV cache tensors from model forward pass

Outputs

Name Type Description
past_key_values tuple Trimmed KV cache tensors

Usage Examples

KV Cache Management in Chat

from kv_cache import StartRecentKVCache

# Initialize with 4 start tokens and 512 recent tokens
kv_cache = StartRecentKVCache(start_size=4, recent_size=512)

# During generation loop:
past_key_values = kv_cache(past_key_values)  # Trim cache
past_key_values = kv_cache.evict_for_space(past_key_values, num_coming=1)  # Make room

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment