Implementation:Intel Ipex llm StartRecentKVCache
| Knowledge Sources | |
|---|---|
| Domains | KV_Cache, Memory_Management, Inference_Optimization |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Concrete tool for managing KV cache memory in streaming LLM inference by retaining only initial and recent tokens.
Description
The StartRecentKVCache class implements a streaming KV cache eviction strategy (adapted from MIT HAN Lab's StreamingLLM) that maintains bounded memory during long-sequence inference. It keeps a configurable number of initial tokens (start_size) and recent tokens (recent_size), evicting middle cache entries when memory pressure occurs. This enables arbitrarily long conversations without running out of KV cache memory, at the cost of losing access to middle-context attention.
Usage
Use this when running interactive chat sessions that may exceed the model's context window. The cache manager transparently evicts old entries while preserving the initial system prompt tokens and recent conversation context.
Code Reference
Source Location
- Repository: Intel IPEX-LLM
- File: python/llm/portable-zip/kv_cache.py
- Lines: 1-158
Signature
class StartRecentKVCache:
def __init__(
self,
start_size: int = 4,
recent_size: int = 512,
k_seq_dim: int = 2,
v_seq_dim: int = 2,
):
"""Initialize KV cache with start and recent size limits."""
def __call__(self, past_key_values):
"""Trim cache to retain only start and recent tokens."""
def evict_for_space(self, past_key_values, num_coming):
"""Evict middle entries to make room for new tokens."""
def evict_range(self, past_key_values, start, end):
"""Remove cache entries within a specified range."""
Import
from kv_cache import StartRecentKVCache
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| start_size | int | No | Number of initial tokens to retain (default: 4) |
| recent_size | int | No | Number of recent tokens to retain (default: 512) |
| past_key_values | tuple | Yes (runtime) | KV cache tensors from model forward pass |
Outputs
| Name | Type | Description |
|---|---|---|
| past_key_values | tuple | Trimmed KV cache tensors |
Usage Examples
KV Cache Management in Chat
from kv_cache import StartRecentKVCache
# Initialize with 4 start tokens and 512 recent tokens
kv_cache = StartRecentKVCache(start_size=4, recent_size=512)
# During generation loop:
past_key_values = kv_cache(past_key_values) # Trim cache
past_key_values = kv_cache.evict_for_space(past_key_values, num_coming=1) # Make room