Implementation:Ollama Ollama MLXRunner KV Cache
| Knowledge Sources | |
|---|---|
| Domains | MLX Runtime, KV Cache |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the Cache interface with two variants: a standard growing KVCache and a RotatingKVCache for sliding window attention with bounded memory.
Description
KVCache stores key and value tensors in pre-allocated buffers that grow in steps of 256 positions. Update appends new KV pairs, growing the buffer when needed. Supports Clone, Trim, and State for cache management. RotatingKVCache extends KVCache with a maximum size and circular buffer behavior: when the buffer fills, new entries overwrite the oldest positions. This enables sliding window attention models to maintain constant memory usage regardless of sequence length.
Usage
Instantiated per-layer during inference. Standard KVCache is used for full attention models; RotatingKVCache is used for sliding window attention architectures.
Code Reference
Source Location
- Repository: Ollama
- File: x/mlxrunner/cache/cache.go
- Lines: 1-198
Signature
type Cache interface {
Update(keys, values *mlx.Array) (newKeys, newValues *mlx.Array)
State() (keys, values *mlx.Array)
Trim(int) int
Clone() Cache
Offset() int
Len() int
}
type KVCache struct {
keys, values *mlx.Array
offset int
step int
}
func NewKVCache() *KVCache
func (c *KVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)
func (c *KVCache) Clone() Cache
func (c *KVCache) Trim(n int) int
type RotatingKVCache struct {
maxSize int
idx int
*KVCache
}
func NewRotatingKVCache(maxSize int) *RotatingKVCache
func (c *RotatingKVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)
Import
import "github.com/ollama/ollama/x/mlxrunner/cache"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| keys | *mlx.Array | Yes | Key tensor [batch, heads, seq_len, dim] |
| values | *mlx.Array | Yes | Value tensor [batch, heads, seq_len, dim] |
Outputs
| Name | Type | Description |
|---|---|---|
| newKeys | *mlx.Array | Concatenated key history up to current offset |
| newValues | *mlx.Array | Concatenated value history up to current offset |
Usage Examples
// Standard cache for full attention
kv := cache.NewKVCache()
keys, values := kv.Update(newKeys, newValues)
// Sliding window cache for bounded memory
rkv := cache.NewRotatingKVCache(4096)
keys, values := rkv.Update(newKeys, newValues)
// Clone for prefix caching
cloned := kv.Clone()
cloned.Trim(10) // Remove last 10 positions