Implementation:Ollama Ollama MLXRunner KV Cache

Knowledge Sources	Ollama
Domains	MLX Runtime, KV Cache
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the Cache interface with two variants: a standard growing KVCache and a RotatingKVCache for sliding window attention with bounded memory.

Description

KVCache stores key and value tensors in pre-allocated buffers that grow in steps of 256 positions. Update appends new KV pairs, growing the buffer when needed. Supports Clone, Trim, and State for cache management. RotatingKVCache extends KVCache with a maximum size and circular buffer behavior: when the buffer fills, new entries overwrite the oldest positions. This enables sliding window attention models to maintain constant memory usage regardless of sequence length.

Usage

Instantiated per-layer during inference. Standard KVCache is used for full attention models; RotatingKVCache is used for sliding window attention architectures.

Code Reference

Source Location

Repository: Ollama
File: x/mlxrunner/cache/cache.go
Lines: 1-198

Signature

type Cache interface {
    Update(keys, values *mlx.Array) (newKeys, newValues *mlx.Array)
    State() (keys, values *mlx.Array)
    Trim(int) int
    Clone() Cache
    Offset() int
    Len() int
}

type KVCache struct {
    keys, values *mlx.Array
    offset       int
    step         int
}

func NewKVCache() *KVCache
func (c *KVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)
func (c *KVCache) Clone() Cache
func (c *KVCache) Trim(n int) int

type RotatingKVCache struct {
    maxSize int
    idx     int
    *KVCache
}

func NewRotatingKVCache(maxSize int) *RotatingKVCache
func (c *RotatingKVCache) Update(keys, values *mlx.Array) (*mlx.Array, *mlx.Array)

Import

import "github.com/ollama/ollama/x/mlxrunner/cache"

I/O Contract

Inputs

Name	Type	Required	Description
keys	*mlx.Array	Yes	Key tensor [batch, heads, seq_len, dim]
values	*mlx.Array	Yes	Value tensor [batch, heads, seq_len, dim]

Outputs

Name	Type	Description
newKeys	*mlx.Array	Concatenated key history up to current offset
newValues	*mlx.Array	Concatenated value history up to current offset

Usage Examples

// Standard cache for full attention
kv := cache.NewKVCache()
keys, values := kv.Update(newKeys, newValues)

// Sliding window cache for bounded memory
rkv := cache.NewRotatingKVCache(4096)
keys, values := rkv.Update(newKeys, newValues)

// Clone for prefix caching
cloned := kv.Clone()
cloned.Trim(10) // Remove last 10 positions

Related Pages

Principle:Ollama_Ollama_MLXRunner_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment