Principle:Ollama Ollama KVCache Causal Attention
| Knowledge Sources | |
|---|---|
| Domains | KV Cache, Attention |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Causal KV Cache is the key-value caching strategy used in autoregressive (causal) transformer models, where each token can only attend to itself and all preceding tokens. The cache stores previously computed key and value projections so they do not need to be recomputed at each generation step, and employs techniques such as ring-buffer management and defragmentation to maintain efficient memory utilization.
Core Concepts
Causal Attention Masking
In autoregressive language models, the attention mechanism is constrained by a causal mask: token at position i can only attend to tokens at positions 0 through i. This lower-triangular masking ensures that the model cannot "look ahead" at future tokens during generation. The KV cache exploits this property: once a token's key and value projections are computed and stored, they remain valid for all subsequent tokens because the causal constraint guarantees they will never be modified by future context.
Incremental Cache Growth
During the prefill (prompt processing) phase, all prompt tokens are processed in parallel and their key-value projections are stored in the cache simultaneously. During the decode (generation) phase, each new token produces one additional key-value pair per layer per attention head, which is appended to the cache. The cache thus grows linearly with sequence length. For a model with L layers, H attention heads, and head dimension d, each token adds 2 * L * H * d values to the cache (one key and one value per layer per head).
Ring-Buffer Management
When the cache reaches its maximum allocated capacity (determined by the model's maximum context length or available memory), a ring-buffer strategy can be employed: new key-value pairs overwrite the oldest entries in a circular fashion. This maintains a fixed memory footprint while preserving the most recent context. However, naive ring-buffer overwriting can break attention continuity. Attention sink techniques address this by preserving the first few tokens (which accumulate disproportionate attention mass) while rotating through the remainder of the buffer.
Cache Defragmentation
As sequences are created and completed in a multi-tenant serving environment, the cache memory becomes fragmented: gaps appear between active sequences' cache regions. Defragmentation compacts active cache entries into contiguous memory, reclaiming gaps for new sequences. This is analogous to memory compaction in operating systems. Defragmentation can be performed by physically copying cache tensors to new positions or by maintaining an indirection table that maps logical positions to physical memory locations (paged attention).
Multiple sequences may share a common prefix (e.g., the same system prompt). Rather than storing duplicate key-value entries for the shared prefix, the cache can reference a single copy. When a new sequence starts with the same prefix as an existing sequence, the cache entries for those positions can be shared (copy-on-write or explicit reference counting). This optimization significantly reduces memory usage in scenarios where many concurrent conversations share the same system prompt or few-shot examples.
Implementation Notes
In the Ollama codebase, causal KV caching is the default cache implementation for standard autoregressive models (Llama, Mistral, GPT-family, etc.). The cache is allocated as contiguous GPU memory tensors sized to the model's maximum context length. Sequence management tracks which positions in the cache belong to which active sequences, using a slot-based system. When sequences complete, their slots are marked as available. Defragmentation is triggered when fragmentation exceeds a configurable threshold, compacting active entries and freeing contiguous blocks. The implementation supports KV cache quantization (storing cached values in lower precision) to reduce memory footprint and increase effective context length.