Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama MLXRunner Cache

From Leeroopedia
Knowledge Sources
Domains MLX Runtime, KV Cache
Last Updated 2025-02-15 00:00 GMT

Overview

Implements prefix-tree based KV cache lookup and insertion for the MLX runner, enabling efficient prompt reuse across inference requests.

Description

CacheEntry stores per-layer KV caches in a trie structure keyed by token IDs. FindNearestCache walks the trie to find the longest matching prefix: exact matches reuse the cached state directly, partial matches return the cached state with remaining tokens to process, and prefix matches (where the cache extends beyond the current tokens) clone and trim the nearest ancestor cache. InsertCache stores new cache states at the end of a completed generation sequence.

Usage

Called at the beginning of each generation request to avoid recomputing KV states for repeated prompt prefixes, significantly reducing time-to-first-token for multi-turn conversations.

Code Reference

Source Location

  • Repository: Ollama
  • File: x/mlxrunner/cache.go
  • Lines: 1-96

Signature

type CacheEntry struct {
    Caches  []cache.Cache
    Count   int
    Entries map[int32]*CacheEntry
}

func (s Runner) FindNearestCache(tokens []int32) ([]cache.Cache, []int32)
func (s *Runner) InsertCache(tokens []int32, caches []cache.Cache)

Import

import "github.com/ollama/ollama/x/mlxrunner"

I/O Contract

Inputs

Name Type Required Description
tokens []int32 Yes Token ID sequence to look up in the cache trie

Outputs

Name Type Description
caches []cache.Cache Matching KV caches (nil on miss)
remaining []int32 Tokens still needing processing

Usage Examples

// Look up cache for a prompt
caches, tokens := runner.FindNearestCache(inputTokens)
if caches == nil {
    // Full cache miss - process all tokens
    caches = make([]cache.Cache, model.NumLayers())
}

// After generation, store the cache
runner.InsertCache(append(inputTokens, outputTokens...), caches)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment