Implementation:Ollama Ollama KVCache Interface

Knowledge Sources	Ollama
Domains	Inference Runtime, KV Cache Management
Last Updated	2025-02-15 00:00 GMT

Overview

Defines the Cache interface that all KV cache implementations must satisfy, providing a unified contract for storing and retrieving key-value tensors during model inference.

Description

The cache.go file declares the Cache interface with methods grouped into two categories: model-facing operations (SetLayer, Get, Put, SetConfig) used during forward passes, and cache management operations (Init, StartForward, Remove, Close) used by the runtime scheduler. It also defines sentinel error variables ErrKvCacheFull and ErrNotSupported used across all cache implementations.

Usage

Used as the polymorphic interface throughout the model and runtime code. Model implementations call SetLayer, Get, and Put during inference. The runtime scheduler calls Init, StartForward, and Remove for lifecycle management.

Code Reference

Source Location

Repository: Ollama
File: kvcache/cache.go
Lines: 1-84

Signature

var (
    ErrKvCacheFull  = errors.New("could not find a kv cache slot")
    ErrNotSupported = errors.New("model does not support operation")
)

type Cache interface {
    SetLayer(layer int)
    Get(ctx ml.Context) (ml.Tensor, ml.Tensor, ml.Tensor)
    Put(ctx ml.Context, key, value ml.Tensor)
    SetConfig(ml.CacheConfig)
    Init(backend ml.Backend, dtype ml.DType, maxSequences, capacity, maxBatch int)
    StartForward(ctx ml.Context, batch input.Batch, reserve bool) error
    Remove(seqId int, beginIndex, endIndex int32) error
    Close()
}

Import

import "github.com/ollama/ollama/kvcache"

I/O Contract

Inputs

Name	Type	Required	Description
layer	int	Yes	Layer index to set as active for Get/Put operations
ctx	ml.Context	Yes	ML context for tensor operations
key	ml.Tensor	Yes	Key tensor to store (Put)
value	ml.Tensor	Yes	Value tensor to store (Put)
batch	input.Batch	Yes	Current input batch with positions and sequence IDs

Outputs

Name	Type	Description
key	ml.Tensor	Cached key tensor history
value	ml.Tensor	Cached value tensor history
mask	ml.Tensor	Attention mask for the cached history
error	error	ErrKvCacheFull if no slots available, ErrNotSupported for unsupported ops

Usage Examples

// Use the Cache interface polymorphically
func runLayer(cache kvcache.Cache, ctx ml.Context, layer int, k, v ml.Tensor) (ml.Tensor, ml.Tensor, ml.Tensor) {
    cache.SetLayer(layer)
    cache.Put(ctx, k, v)
    return cache.Get(ctx)
}

Related Pages

Principle:Ollama_Ollama_KVCache_Abstraction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment