Implementation:Ollama Ollama MLXRunner Cache
| Knowledge Sources | |
|---|---|
| Domains | MLX Runtime, KV Cache |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements prefix-tree based KV cache lookup and insertion for the MLX runner, enabling efficient prompt reuse across inference requests.
Description
CacheEntry stores per-layer KV caches in a trie structure keyed by token IDs. FindNearestCache walks the trie to find the longest matching prefix: exact matches reuse the cached state directly, partial matches return the cached state with remaining tokens to process, and prefix matches (where the cache extends beyond the current tokens) clone and trim the nearest ancestor cache. InsertCache stores new cache states at the end of a completed generation sequence.
Usage
Called at the beginning of each generation request to avoid recomputing KV states for repeated prompt prefixes, significantly reducing time-to-first-token for multi-turn conversations.
Code Reference
Source Location
- Repository: Ollama
- File: x/mlxrunner/cache.go
- Lines: 1-96
Signature
type CacheEntry struct {
Caches []cache.Cache
Count int
Entries map[int32]*CacheEntry
}
func (s Runner) FindNearestCache(tokens []int32) ([]cache.Cache, []int32)
func (s *Runner) InsertCache(tokens []int32, caches []cache.Cache)
Import
import "github.com/ollama/ollama/x/mlxrunner"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokens | []int32 | Yes | Token ID sequence to look up in the cache trie |
Outputs
| Name | Type | Description |
|---|---|---|
| caches | []cache.Cache | Matching KV caches (nil on miss) |
| remaining | []int32 | Tokens still needing processing |
Usage Examples
// Look up cache for a prompt
caches, tokens := runner.FindNearestCache(inputTokens)
if caches == nil {
// Full cache miss - process all tokens
caches = make([]cache.Cache, model.NumLayers())
}
// After generation, store the cache
runner.InsertCache(append(inputTokens, outputTokens...), caches)