Principle:Ollama Ollama KVCache Abstraction
| Knowledge Sources | |
|---|---|
| Domains | KV Cache, Memory Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
The KV Cache Abstraction is the principle of defining a unified interface for key-value caching that supports diverse attention mechanisms (causal, encoder-decoder, sliding window, hybrid) through a common API. This enables the inference pipeline to work with different model architectures without coupling to specific caching strategies.
Core Concepts
Interface-Based Cache Design
An abstract KV cache interface defines the operations that any cache implementation must support: allocating cache entries for new sequences, looking up cached keys and values for a given layer and attention head, storing new key-value pairs, removing entries when sequences complete, and reporting memory usage. By programming to this interface rather than a concrete implementation, the inference engine can transparently support different caching strategies (dense, sparse, sliding window, paged) without modification.
Sequence-Level Cache Management
KV caches operate at the granularity of sequences (individual requests or conversations). Each sequence maintains its own set of cached key-value tensors, indexed by layer and position. The cache abstraction must support operations to create a new sequence's cache, extend it as new tokens are generated, truncate it when context windows overflow, and free it when the sequence completes. Multi-tenant environments require isolation between sequences while allowing shared memory pools for efficient utilization.
Layer-Indexed Storage
Transformer models have multiple layers, each with its own set of attention key-value pairs. The cache abstraction indexes stored tensors by (sequence, layer, position) tuples. This three-dimensional indexing allows layer-specific optimizations: different layers might use different precision (FP16 keys with FP8 values), different layers in hybrid models might use causal vs. sliding window attention, and specific layers might be designated for cross-attention in encoder-decoder architectures.
Memory Budget Management
GPU memory is the primary constraint for KV cache capacity. The abstraction must expose methods for querying current memory usage, maximum capacity, and available space. When memory is exhausted, the cache must support eviction strategies: removing the oldest sequences (FIFO), removing the least recently used sequences (LRU), or compacting existing caches through techniques like token dropping or attention sink preservation. The memory budget is typically configured at startup based on hardware discovery results.
Implementation Notes
In the Ollama codebase, the KV cache abstraction is defined as a Go interface and a corresponding C/C++ abstract base in the llama.cpp layer. The Go-level interface wraps the native cache through the CGo bridge, providing methods for cache creation, sequence management, memory queries, and defragmentation. Concrete implementations include causal caches (for standard autoregressive models), encoder caches (for encoder-decoder models), sliding window caches (for models with limited attention spans), and hybrid caches (for models combining attention with recurrent state). The abstraction allows the scheduler and inference handler to manage cache lifecycles without knowing which specific cache type is in use.