Implementation:Intel Ipex llm StartRecentKVCache

Knowledge Sources	Intel IPEX-LLM
Domains	KV_Cache, Memory_Management, Inference_Optimization
Last Updated	2026-02-09 04:00 GMT

Overview

Concrete tool for managing KV cache memory in streaming LLM inference by retaining only initial and recent tokens.

Description

The StartRecentKVCache class implements a streaming KV cache eviction strategy (adapted from MIT HAN Lab's StreamingLLM) that maintains bounded memory during long-sequence inference. It keeps a configurable number of initial tokens (start_size) and recent tokens (recent_size), evicting middle cache entries when memory pressure occurs. This enables arbitrarily long conversations without running out of KV cache memory, at the cost of losing access to middle-context attention.

Usage

Use this when running interactive chat sessions that may exceed the model's context window. The cache manager transparently evicts old entries while preserving the initial system prompt tokens and recent conversation context.

Code Reference

Source Location

Repository: Intel IPEX-LLM
File: python/llm/portable-zip/kv_cache.py
Lines: 1-158

Signature

class StartRecentKVCache:
    def __init__(
        self,
        start_size: int = 4,
        recent_size: int = 512,
        k_seq_dim: int = 2,
        v_seq_dim: int = 2,
    ):
        """Initialize KV cache with start and recent size limits."""

    def __call__(self, past_key_values):
        """Trim cache to retain only start and recent tokens."""

    def evict_for_space(self, past_key_values, num_coming):
        """Evict middle entries to make room for new tokens."""

    def evict_range(self, past_key_values, start, end):
        """Remove cache entries within a specified range."""

Import

from kv_cache import StartRecentKVCache

I/O Contract

Inputs

Name	Type	Required	Description
start_size	int	No	Number of initial tokens to retain (default: 4)
recent_size	int	No	Number of recent tokens to retain (default: 512)
past_key_values	tuple	Yes (runtime)	KV cache tensors from model forward pass

Outputs

Name	Type	Description
past_key_values	tuple	Trimmed KV cache tensors

Usage Examples

KV Cache Management in Chat

from kv_cache import StartRecentKVCache

# Initialize with 4 start tokens and 512 recent tokens
kv_cache = StartRecentKVCache(start_size=4, recent_size=512)

# During generation loop:
past_key_values = kv_cache(past_key_values)  # Trim cache
past_key_values = kv_cache.evict_for_space(past_key_values, num_coming=1)  # Make room

Related Pages

Environment:Intel_Ipex_llm_Portable_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment