Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama MLXRunner KV Cache

From Leeroopedia
Knowledge Sources
Domains MLX Runtime, KV Cache
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the Cache interface with two variants: a standard growing KVCache and a RotatingKVCache for sliding window attention with bounded memory.

Description

KVCache stores key and value tensors in pre-allocated buffers that grow in steps of 256 positions. Update appends new KV pairs, growing the buffer when needed. Supports Clone, Trim, and State for cache management. RotatingKVCache extends KVCache with a maximum size and circular buffer behavior: when the buffer fills, new entries overwrite the oldest positions. This enables sliding window attention models to maintain constant memory usage regardless of sequence length.

Usage

Instantiated per-layer during inference. Standard KVCache is used for full attention models; RotatingKVCache is used for sliding window attention architectures.

Code Reference

Source Location

  • Repository: Ollama
  • File: x/mlxrunner/cache/cache.go
  • Lines: 1-198

Signature

type Cache interface {
    Update(keys, values *mlx.Array) (newKeys, newValues *mlx.Array)
    State() (keys, values *mlx.Array)
    Trim(int) int
    Clone() Cache
    Offset() int
    Len() int
}

type KVCache struct {
    keys, values *mlx.Array
    offset       int
    step         int
}

func NewKVCache() *KVCache
func (c *KVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)
func (c *KVCache) Clone() Cache
func (c *KVCache) Trim(n int) int

type RotatingKVCache struct {
    maxSize int
    idx     int
    *KVCache
}

func NewRotatingKVCache(maxSize int) *RotatingKVCache
func (c *RotatingKVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)

Import

import "github.com/ollama/ollama/x/mlxrunner/cache"

I/O Contract

Inputs

Name Type Required Description
keys *mlx.Array Yes Key tensor [batch, heads, seq_len, dim]
values *mlx.Array Yes Value tensor [batch, heads, seq_len, dim]

Outputs

Name Type Description
newKeys *mlx.Array Concatenated key history up to current offset
newValues *mlx.Array Concatenated value history up to current offset

Usage Examples

// Standard cache for full attention
kv := cache.NewKVCache()
keys, values := kv.Update(newKeys, newValues)

// Sliding window cache for bounded memory
rkv := cache.NewRotatingKVCache(4096)
keys, values := rkv.Update(newKeys, newValues)

// Clone for prefix caching
cloned := kv.Clone()
cloned.Trim(10) // Remove last 10 positions

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment