Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama KV Cache

From Leeroopedia
Knowledge Sources
Domains LLM Inference, Memory Management
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the primary KV (key-value) cache for transformer attention, handling cache allocation, slot management, sequence operations, defragmentation, state serialization, and the graph-level K-shift operation.

Description

The constructor allocates per-layer key and value tensors with appropriate backend buffer types, supporting both streaming (per-sequence) and unified modes. Implements slot finding/assignment for incoming batches, sequence operations (rm, cp, keep, add, div) that manipulate the underlying llama_kv_cells metadata, and defragmentation to compact fragmented cache slots. The llama_kv_cache_context class manages batch-level cache state including slot allocation, stream copy operations, and attention mask/position generation for the compute graph.

Usage

This is the most critical memory component for transformer inference. Every attention-based model routes key/value storage through this class, which stores computed key/value pairs from previous tokens to enable efficient autoregressive generation.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/llama-kv-cache.cpp
  • Lines: 1-2100

Signature

llama_kv_cache::llama_kv_cache(
        const llama_model & model,
                ggml_type   type_k,
                ggml_type   type_v,
                     bool   v_trans,
                     bool   offload,
                     bool   unified,
                 uint32_t   kv_size,
                 uint32_t   n_seq_max,
                 uint32_t   n_pad,
                 uint32_t   n_swa,
           llama_swa_type   swa_type,
    const layer_filter_cb & filter,
    const  layer_reuse_cb & reuse);

slot_info_vec_t prepare(const std::vector<llama_ubatch> & ubatches);
slot_info find_slot(const llama_ubatch & ubatch, bool cont) const;
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
bool update(llama_context * lctx, bool do_shift, const stream_copy_info & sc_info);

Import

#include "llama-kv-cache.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes The model providing layer configuration and device info
type_k ggml_type Yes Data type for key cache tensors
type_v ggml_type Yes Data type for value cache tensors
kv_size uint32_t Yes Total number of KV cache cells
n_swa uint32_t Yes Sliding window size (0 for no SWA)

Outputs

Name Type Description
slot_info_vec_t std::vector<slot_info> Cache slot assignments for ubatches
get_k/get_v ggml_tensor* Key/value tensor views for graph building

Usage Examples

// KV cache is created internally by the memory system
// Prepare slots for batches:
auto sinfos = kv_cache->prepare(ubatches);

// Get key/value tensors for graph construction:
ggml_tensor * k = ctx->get_k(ggml_ctx, layer_id);
ggml_tensor * v = ctx->get_v(ggml_ctx, layer_id);

// Sequence operations:
kv_cache->seq_rm(seq_id, 0, -1);   // remove sequence
kv_cache->seq_add(seq_id, 0, -1, shift); // shift positions

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment