Implementation:Ollama Ollama Llama KV Cache

Knowledge Sources	Ollama
Domains	LLM Inference, Memory Management
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the primary KV (key-value) cache for transformer attention, handling cache allocation, slot management, sequence operations, defragmentation, state serialization, and the graph-level K-shift operation.

Description

The constructor allocates per-layer key and value tensors with appropriate backend buffer types, supporting both streaming (per-sequence) and unified modes. Implements slot finding/assignment for incoming batches, sequence operations (rm, cp, keep, add, div) that manipulate the underlying llama_kv_cells metadata, and defragmentation to compact fragmented cache slots. The llama_kv_cache_context class manages batch-level cache state including slot allocation, stream copy operations, and attention mask/position generation for the compute graph.

Usage

This is the most critical memory component for transformer inference. Every attention-based model routes key/value storage through this class, which stores computed key/value pairs from previous tokens to enable efficient autoregressive generation.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-kv-cache.cpp
Lines: 1-2100

Signature

llama_kv_cache::llama_kv_cache(
        const llama_model & model,
                ggml_type   type_k,
                ggml_type   type_v,
                     bool   v_trans,
                     bool   offload,
                     bool   unified,
                 uint32_t   kv_size,
                 uint32_t   n_seq_max,
                 uint32_t   n_pad,
                 uint32_t   n_swa,
           llama_swa_type   swa_type,
    const layer_filter_cb & filter,
    const  layer_reuse_cb & reuse);

slot_info_vec_t prepare(const std::vector<llama_ubatch> & ubatches);
slot_info find_slot(const llama_ubatch & ubatch, bool cont) const;
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
bool update(llama_context * lctx, bool do_shift, const stream_copy_info & sc_info);

Import

#include "llama-kv-cache.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	The model providing layer configuration and device info
type_k	ggml_type	Yes	Data type for key cache tensors
type_v	ggml_type	Yes	Data type for value cache tensors
kv_size	uint32_t	Yes	Total number of KV cache cells
n_swa	uint32_t	Yes	Sliding window size (0 for no SWA)

Outputs

Name	Type	Description
slot_info_vec_t	std::vector<slot_info>	Cache slot assignments for ubatches
get_k/get_v	ggml_tensor*	Key/value tensor views for graph building

Usage Examples

// KV cache is created internally by the memory system
// Prepare slots for batches:
auto sinfos = kv_cache->prepare(ubatches);

// Get key/value tensors for graph construction:
ggml_tensor * k = ctx->get_k(ggml_ctx, layer_id);
ggml_tensor * v = ctx->get_v(ggml_ctx, layer_id);

// Sequence operations:
kv_cache->seq_rm(seq_id, 0, -1);   // remove sequence
kv_cache->seq_add(seq_id, 0, -1, shift); // shift positions

Related Pages

Principle:Ollama_Ollama_LLM_Memory_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment