Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp KV Cache

From Leeroopedia
Knowledge Sources
Domains KV_Cache, Memory
Last Updated 2026-02-15 00:00 GMT

Overview

Implements the `llama_kv_cache` class, which manages key-value attention cache allocation, slot finding, defragmentation, and state serialization.

Description

This file constructs per-layer K and V tensors with appropriate buffer types and sizes, supporting multi-stream (per-sequence) or unified caching modes. It implements slot finding (`find_slot`) with contiguous and non-contiguous strategies, ubatch application (`apply_ubatch`) to insert tokens into cache cells, and cache maintenance operations (sequence removal/copy/keep/shift/divide). It also supports KV cache shifting for context extension, defragmentation to compact used cells, and state save/load for session persistence. The `llama_kv_cache_context` class manages per-ubatch slot information and coordinates batch processing.

Usage

Use this module as the primary KV cache implementation for transformer-based models. It is the essential backend for efficient autoregressive generation, storing previously computed key-value pairs to avoid redundant computation.

Code Reference

Source Location

Signature

llama_kv_cache::llama_kv_cache(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v,
    bool v_trans, bool offload, bool unified,
    uint32_t kv_size, uint32_t n_seq_max, uint32_t n_pad,
    uint32_t n_swa, llama_swa_type swa_type,
    const layer_filter_cb & filter, const layer_reuse_cb & reuse);

// Sequence operations
bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_keep(llama_seq_id seq_id);
void llama_kv_cache::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);

// Slot management
bool llama_kv_cache::find_slot(/* ... */);
void llama_kv_cache::apply_ubatch(/* ... */);

// State persistence
void llama_kv_cache::state_write(/* ... */);
void llama_kv_cache::state_read(/* ... */);

Import

#include "llama-kv-cache.h"
#include "llama-impl.h"
#include "llama-io.h"
#include "llama-model.h"
#include "llama-context.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Model defining layer structure and buffer types for K/V tensors
type_k ggml_type Yes Data type for key cache tensors (e.g., F16, Q8_0)
type_v ggml_type Yes Data type for value cache tensors
kv_size uint32_t Yes Total number of cache cells to allocate
n_seq_max uint32_t Yes Maximum number of concurrent sequences
n_swa uint32_t No Sliding window size (0 for no SWA)
swa_type llama_swa_type No Type of sliding window attention

Outputs

Name Type Description
k_l std::vector<ggml_tensor *> Per-layer key cache tensors
v_l std::vector<ggml_tensor *> Per-layer value cache tensors
slot_info struct slot_info Cell indices mapping for each ubatch token placement
state binary data Serialized cache state for session persistence

Usage Examples

// Create KV cache
auto kv_cache = std::make_unique<llama_kv_cache>(
    model, GGML_TYPE_F16, GGML_TYPE_F16,
    /*v_trans=*/true, /*offload=*/true, /*unified=*/false,
    kv_size, n_seq_max, n_pad,
    /*n_swa=*/0, LLAMA_SWA_TYPE_NONE,
    nullptr, nullptr);

// Manage sequences
kv_cache->seq_rm(seq_id, 0, -1);  // remove all positions for a sequence
kv_cache->seq_cp(0, 1, 0, -1);    // copy sequence 0 to sequence 1
kv_cache->seq_keep(seq_id);        // keep only this sequence
kv_cache->clear(true);             // clear all cached data

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment