Principle:Intel Ipex llm KV Cache Eviction
| Knowledge Sources | |
|---|---|
| Domains | Memory_Management, Inference_Optimization, Attention |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Memory management technique that enables bounded-memory streaming LLM inference by selectively evicting middle KV cache entries while retaining initial and recent tokens.
Description
KV cache eviction addresses the linear memory growth problem in autoregressive LLM inference. As the conversation length grows, the key-value cache accumulates entries for every generated token. The "Start-Recent" eviction strategy (based on the StreamingLLM/Attention Sinks research) observes that attention patterns typically concentrate on initial tokens (attention sinks) and recent tokens. By retaining only these two groups and evicting the middle, the cache size stays bounded regardless of conversation length.
Usage
Use this principle when running interactive chat sessions that may exceed the model's context window. It is essential for portable/edge deployments where memory is constrained and conversations are long-running.
Theoretical Basis
Given a KV cache of length , start size , and recent size :
The cache is partitioned as: , evicting positions .
Total cache size is bounded at regardless of conversation length.
Pseudo-code Logic:
# Abstract KV cache eviction
def evict(past_key_values, start_size, recent_size):
seq_len = past_key_values.shape[seq_dim]
if seq_len <= start_size + recent_size:
return past_key_values # No eviction needed
# Keep start tokens and recent tokens, drop middle
start = past_key_values[:start_size]
recent = past_key_values[-recent_size:]
return concat(start, recent)