Implementation:Ggml org Llama cpp KV Cache

Knowledge Sources	Ggml_org_Llama_cpp
Domains	KV_Cache, Memory
Last Updated	2026-02-15 00:00 GMT

Overview

Implements the `llama_kv_cache` class, which manages key-value attention cache allocation, slot finding, defragmentation, and state serialization.

Description

This file constructs per-layer K and V tensors with appropriate buffer types and sizes, supporting multi-stream (per-sequence) or unified caching modes. It implements slot finding (`find_slot`) with contiguous and non-contiguous strategies, ubatch application (`apply_ubatch`) to insert tokens into cache cells, and cache maintenance operations (sequence removal/copy/keep/shift/divide). It also supports KV cache shifting for context extension, defragmentation to compact used cells, and state save/load for session persistence. The `llama_kv_cache_context` class manages per-ubatch slot information and coordinates batch processing.

Usage

Use this module as the primary KV cache implementation for transformer-based models. It is the essential backend for efficient autoregressive generation, storing previously computed key-value pairs to avoid redundant computation.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-kv-cache.cpp
Lines: 1-2268

Signature

llama_kv_cache::llama_kv_cache(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v,
    bool v_trans, bool offload, bool unified,
    uint32_t kv_size, uint32_t n_seq_max, uint32_t n_pad,
    uint32_t n_swa, llama_swa_type swa_type,
    const layer_filter_cb & filter, const layer_reuse_cb & reuse);

// Sequence operations
bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_keep(llama_seq_id seq_id);
void llama_kv_cache::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);

// Slot management
bool llama_kv_cache::find_slot(/* ... */);
void llama_kv_cache::apply_ubatch(/* ... */);

// State persistence
void llama_kv_cache::state_write(/* ... */);
void llama_kv_cache::state_read(/* ... */);

Import

#include "llama-kv-cache.h"
#include "llama-impl.h"
#include "llama-io.h"
#include "llama-model.h"
#include "llama-context.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model defining layer structure and buffer types for K/V tensors
type_k	ggml_type	Yes	Data type for key cache tensors (e.g., F16, Q8_0)
type_v	ggml_type	Yes	Data type for value cache tensors
kv_size	uint32_t	Yes	Total number of cache cells to allocate
n_seq_max	uint32_t	Yes	Maximum number of concurrent sequences
n_swa	uint32_t	No	Sliding window size (0 for no SWA)
swa_type	llama_swa_type	No	Type of sliding window attention

Outputs

Name	Type	Description
k_l	std::vector<ggml_tensor *>	Per-layer key cache tensors
v_l	std::vector<ggml_tensor *>	Per-layer value cache tensors
slot_info	struct slot_info	Cell indices mapping for each ubatch token placement
state	binary data	Serialized cache state for session persistence

Usage Examples

// Create KV cache
auto kv_cache = std::make_unique<llama_kv_cache>(
    model, GGML_TYPE_F16, GGML_TYPE_F16,
    /*v_trans=*/true, /*offload=*/true, /*unified=*/false,
    kv_size, n_seq_max, n_pad,
    /*n_swa=*/0, LLAMA_SWA_TYPE_NONE,
    nullptr, nullptr);

// Manage sequences
kv_cache->seq_rm(seq_id, 0, -1);  // remove all positions for a sequence
kv_cache->seq_cp(0, 1, 0, -1);    // copy sequence 0 to sequence 1
kv_cache->seq_keep(seq_id);        // keep only this sequence
kv_cache->clear(true);             // clear all cached data

Related Pages

Principle:Ggml_org_Llama_cpp_KVCacheManagement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment