Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama KVCache Causal Attention

From Leeroopedia
Knowledge Sources
Domains KV Cache, Attention
Last Updated 2025-02-15 00:00 GMT

Overview

Causal KV Cache is the key-value caching strategy used in autoregressive (causal) transformer models, where each token can only attend to itself and all preceding tokens. The cache stores previously computed key and value projections so they do not need to be recomputed at each generation step, and employs techniques such as ring-buffer management and defragmentation to maintain efficient memory utilization.

Core Concepts

Causal Attention Masking

In autoregressive language models, the attention mechanism is constrained by a causal mask: token at position i can only attend to tokens at positions 0 through i. This lower-triangular masking ensures that the model cannot "look ahead" at future tokens during generation. The KV cache exploits this property: once a token's key and value projections are computed and stored, they remain valid for all subsequent tokens because the causal constraint guarantees they will never be modified by future context.

Incremental Cache Growth

During the prefill (prompt processing) phase, all prompt tokens are processed in parallel and their key-value projections are stored in the cache simultaneously. During the decode (generation) phase, each new token produces one additional key-value pair per layer per attention head, which is appended to the cache. The cache thus grows linearly with sequence length. For a model with L layers, H attention heads, and head dimension d, each token adds 2 * L * H * d values to the cache (one key and one value per layer per head).

Ring-Buffer Management

When the cache reaches its maximum allocated capacity (determined by the model's maximum context length or available memory), a ring-buffer strategy can be employed: new key-value pairs overwrite the oldest entries in a circular fashion. This maintains a fixed memory footprint while preserving the most recent context. However, naive ring-buffer overwriting can break attention continuity. Attention sink techniques address this by preserving the first few tokens (which accumulate disproportionate attention mass) while rotating through the remainder of the buffer.

Cache Defragmentation

As sequences are created and completed in a multi-tenant serving environment, the cache memory becomes fragmented: gaps appear between active sequences' cache regions. Defragmentation compacts active cache entries into contiguous memory, reclaiming gaps for new sequences. This is analogous to memory compaction in operating systems. Defragmentation can be performed by physically copying cache tensors to new positions or by maintaining an indirection table that maps logical positions to physical memory locations (paged attention).

Shared Prefix Optimization

Multiple sequences may share a common prefix (e.g., the same system prompt). Rather than storing duplicate key-value entries for the shared prefix, the cache can reference a single copy. When a new sequence starts with the same prefix as an existing sequence, the cache entries for those positions can be shared (copy-on-write or explicit reference counting). This optimization significantly reduces memory usage in scenarios where many concurrent conversations share the same system prompt or few-shot examples.

Implementation Notes

In the Ollama codebase, causal KV caching is the default cache implementation for standard autoregressive models (Llama, Mistral, GPT-family, etc.). The cache is allocated as contiguous GPU memory tensors sized to the model's maximum context length. Sequence management tracks which positions in the cache belong to which active sequences, using a slot-based system. When sequences complete, their slots are marked as available. Defragmentation is triggered when fragmentation exceeds a configurable threshold, compacting active entries and freeing contiguous blocks. The implementation supports KV cache quantization (storing cached values in lower precision) to reduce memory footprint and increase effective context length.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment