Implementation:Ggml org Llama cpp KV Cache Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	KV_Cache, Memory
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the `llama_kv_cache` class and the `llama_kv_cache_context` class, defining the interface for the primary KV attention cache.

Description

The `llama_kv_cache` implements `llama_memory_i` and provides `slot_info` for tracking where ubatch tokens map to cache cells (with contiguity checking), `stream_copy_info` for stream operations, and methods for cache lifecycle (init_batch, init_full, init_update), sequence manipulation (seq_rm, seq_cp, seq_keep, seq_add, seq_div), graph building (get_k/v, cpy_k/v, build_input_*), slot management (prepare, find_slot, apply_ubatch), and state serialization. It supports sliding window attention (SWA) via n_swa/swa_type parameters, multi-stream mode, and MLA (multi-latent attention) configurations.

Usage

Include this header when working with the KV cache subsystem. It defines the KV cache abstraction that transformer-based models depend on for efficient inference with past context reuse.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-kv-cache.h
Lines: 1-388

Signature

class llama_kv_cache : public llama_memory_i {
public:
    struct stream_copy_info {
        bool empty() const;
        std::vector<uint32_t> ssrc;
        std::vector<uint32_t> sdst;
    };

    struct slot_info {
        using idx_vec_t = std::vector<uint32_t>;
        uint32_t s0, s1;
        std::vector<llama_seq_id> strm;
        std::vector<idx_vec_t> idxs;
        uint32_t head() const;
        size_t size() const;
        size_t n_stream() const;
        bool is_contiguous() const;
    };

    // Constructor, sequence ops, slot management, state persistence...
};

class llama_kv_cache_context {
    // Manages per-ubatch slot info and batch processing coordination
};

Import

#pragma once
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-kv-cells.h"
#include "llama-memory.h"
#include <unordered_map>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model definition for layer structure and buffer types
type_k	ggml_type	Yes	Key tensor data type
type_v	ggml_type	Yes	Value tensor data type
kv_size	uint32_t	Yes	Number of cache cells
n_swa	uint32_t	No	Sliding window size for SWA-enabled layers
swa_type	llama_swa_type	No	Sliding window attention type

Outputs

Name	Type	Description
slot_info	struct slot_info	Maps tokens to cache cell indices with contiguity metadata
stream_copy_info	struct stream_copy_info	Source and destination streams for copy operations
k_l/v_l	ggml_tensor *	Per-layer key and value cache tensors

Usage Examples

// Slot info usage
llama_kv_cache::slot_info sinfo;
uint32_t head_pos = sinfo.head();       // first cell index
bool contiguous = sinfo.is_contiguous(); // check if cells are contiguous
size_t n_tokens = sinfo.size();          // number of tokens in slot

// Stream copy info
llama_kv_cache::stream_copy_info copy_info;
if (!copy_info.empty()) {
    // perform stream copy operations
}

Related Pages

Principle:Ggml_org_Llama_cpp_KVCacheManagement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment