Principle:Intel Ipex llm KV Cache Eviction

Knowledge Sources	Efficient Streaming Language Models with Attention Sinks
Domains	Memory_Management, Inference_Optimization, Attention
Last Updated	2026-02-09 04:00 GMT

Overview

Memory management technique that enables bounded-memory streaming LLM inference by selectively evicting middle KV cache entries while retaining initial and recent tokens.

Description

KV cache eviction addresses the linear memory growth problem in autoregressive LLM inference. As the conversation length grows, the key-value cache accumulates entries for every generated token. The "Start-Recent" eviction strategy (based on the StreamingLLM/Attention Sinks research) observes that attention patterns typically concentrate on initial tokens (attention sinks) and recent tokens. By retaining only these two groups and evicting the middle, the cache size stays bounded regardless of conversation length.

Usage

Use this principle when running interactive chat sessions that may exceed the model's context window. It is essential for portable/edge deployments where memory is constrained and conversations are long-running.

Theoretical Basis

Given a KV cache of length $n$ , start size $s$ , and recent size $r$ :

The cache is partitioned as: $[0 . . s) \cup [n - r . . n)$ , evicting positions $[s . . n - r)$ .

Total cache size is bounded at $s + r$ regardless of conversation length.

Pseudo-code Logic:

# Abstract KV cache eviction
def evict(past_key_values, start_size, recent_size):
    seq_len = past_key_values.shape[seq_dim]
    if seq_len <= start_size + recent_size:
        return past_key_values  # No eviction needed
    # Keep start tokens and recent tokens, drop middle
    start = past_key_values[:start_size]
    recent = past_key_values[-recent_size:]
    return concat(start, recent)

Related Pages

Implementation:Intel_Ipex_llm_StartRecentKVCache

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment