Implementation:Ollama Ollama MLXRunner Cache

Knowledge Sources	Ollama
Domains	MLX Runtime, KV Cache
Last Updated	2025-02-15 00:00 GMT

Overview

Implements prefix-tree based KV cache lookup and insertion for the MLX runner, enabling efficient prompt reuse across inference requests.

Description

CacheEntry stores per-layer KV caches in a trie structure keyed by token IDs. FindNearestCache walks the trie to find the longest matching prefix: exact matches reuse the cached state directly, partial matches return the cached state with remaining tokens to process, and prefix matches (where the cache extends beyond the current tokens) clone and trim the nearest ancestor cache. InsertCache stores new cache states at the end of a completed generation sequence.

Usage

Called at the beginning of each generation request to avoid recomputing KV states for repeated prompt prefixes, significantly reducing time-to-first-token for multi-turn conversations.

Code Reference

Source Location

Repository: Ollama
File: x/mlxrunner/cache.go
Lines: 1-96

Signature

type CacheEntry struct {
    Caches  []cache.Cache
    Count   int
    Entries map[int32]*CacheEntry
}

func (s Runner) FindNearestCache(tokens []int32) ([]cache.Cache, []int32)
func (s *Runner) InsertCache(tokens []int32, caches []cache.Cache)

Import

import "github.com/ollama/ollama/x/mlxrunner"

I/O Contract

Inputs

Name	Type	Required	Description
tokens	[]int32	Yes	Token ID sequence to look up in the cache trie

Outputs

Name	Type	Description
caches	[]cache.Cache	Matching KV caches (nil on miss)
remaining	[]int32	Tokens still needing processing

Usage Examples

// Look up cache for a prompt
caches, tokens := runner.FindNearestCache(inputTokens)
if caches == nil {
    // Full cache miss - process all tokens
    caches = make([]cache.Cache, model.NumLayers())
}

// After generation, store the cache
runner.InsertCache(append(inputTokens, outputTokens...), caches)

Related Pages

Principle:Ollama_Ollama_MLXRunner_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment