Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp KV Cache Header

From Leeroopedia
Revision as of 12:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_KV_Cache_Header.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains KV_Cache, Memory
Last Updated 2026-02-15 00:00 GMT

Overview

Declares the `llama_kv_cache` class and the `llama_kv_cache_context` class, defining the interface for the primary KV attention cache.

Description

The `llama_kv_cache` implements `llama_memory_i` and provides `slot_info` for tracking where ubatch tokens map to cache cells (with contiguity checking), `stream_copy_info` for stream operations, and methods for cache lifecycle (init_batch, init_full, init_update), sequence manipulation (seq_rm, seq_cp, seq_keep, seq_add, seq_div), graph building (get_k/v, cpy_k/v, build_input_*), slot management (prepare, find_slot, apply_ubatch), and state serialization. It supports sliding window attention (SWA) via n_swa/swa_type parameters, multi-stream mode, and MLA (multi-latent attention) configurations.

Usage

Include this header when working with the KV cache subsystem. It defines the KV cache abstraction that transformer-based models depend on for efficient inference with past context reuse.

Code Reference

Source Location

Signature

class llama_kv_cache : public llama_memory_i {
public:
    struct stream_copy_info {
        bool empty() const;
        std::vector<uint32_t> ssrc;
        std::vector<uint32_t> sdst;
    };

    struct slot_info {
        using idx_vec_t = std::vector<uint32_t>;
        uint32_t s0, s1;
        std::vector<llama_seq_id> strm;
        std::vector<idx_vec_t> idxs;
        uint32_t head() const;
        size_t size() const;
        size_t n_stream() const;
        bool is_contiguous() const;
    };

    // Constructor, sequence ops, slot management, state persistence...
};

class llama_kv_cache_context {
    // Manages per-ubatch slot info and batch processing coordination
};

Import

#pragma once
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-kv-cells.h"
#include "llama-memory.h"
#include <unordered_map>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Model definition for layer structure and buffer types
type_k ggml_type Yes Key tensor data type
type_v ggml_type Yes Value tensor data type
kv_size uint32_t Yes Number of cache cells
n_swa uint32_t No Sliding window size for SWA-enabled layers
swa_type llama_swa_type No Sliding window attention type

Outputs

Name Type Description
slot_info struct slot_info Maps tokens to cache cell indices with contiguity metadata
stream_copy_info struct stream_copy_info Source and destination streams for copy operations
k_l/v_l ggml_tensor * Per-layer key and value cache tensors

Usage Examples

// Slot info usage
llama_kv_cache::slot_info sinfo;
uint32_t head_pos = sinfo.head();       // first cell index
bool contiguous = sinfo.is_contiguous(); // check if cells are contiguous
size_t n_tokens = sinfo.size();          // number of tokens in slot

// Stream copy info
llama_kv_cache::stream_copy_info copy_info;
if (!copy_info.empty()) {
    // perform stream copy operations
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment