Implementation:Ggml org Llama cpp KV Cache Header
| Knowledge Sources | |
|---|---|
| Domains | KV_Cache, Memory |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the `llama_kv_cache` class and the `llama_kv_cache_context` class, defining the interface for the primary KV attention cache.
Description
The `llama_kv_cache` implements `llama_memory_i` and provides `slot_info` for tracking where ubatch tokens map to cache cells (with contiguity checking), `stream_copy_info` for stream operations, and methods for cache lifecycle (init_batch, init_full, init_update), sequence manipulation (seq_rm, seq_cp, seq_keep, seq_add, seq_div), graph building (get_k/v, cpy_k/v, build_input_*), slot management (prepare, find_slot, apply_ubatch), and state serialization. It supports sliding window attention (SWA) via n_swa/swa_type parameters, multi-stream mode, and MLA (multi-latent attention) configurations.
Usage
Include this header when working with the KV cache subsystem. It defines the KV cache abstraction that transformer-based models depend on for efficient inference with past context reuse.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-kv-cache.h
- Lines: 1-388
Signature
class llama_kv_cache : public llama_memory_i {
public:
struct stream_copy_info {
bool empty() const;
std::vector<uint32_t> ssrc;
std::vector<uint32_t> sdst;
};
struct slot_info {
using idx_vec_t = std::vector<uint32_t>;
uint32_t s0, s1;
std::vector<llama_seq_id> strm;
std::vector<idx_vec_t> idxs;
uint32_t head() const;
size_t size() const;
size_t n_stream() const;
bool is_contiguous() const;
};
// Constructor, sequence ops, slot management, state persistence...
};
class llama_kv_cache_context {
// Manages per-ubatch slot info and batch processing coordination
};
Import
#pragma once
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-kv-cells.h"
#include "llama-memory.h"
#include <unordered_map>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model definition for layer structure and buffer types |
| type_k | ggml_type | Yes | Key tensor data type |
| type_v | ggml_type | Yes | Value tensor data type |
| kv_size | uint32_t | Yes | Number of cache cells |
| n_swa | uint32_t | No | Sliding window size for SWA-enabled layers |
| swa_type | llama_swa_type | No | Sliding window attention type |
Outputs
| Name | Type | Description |
|---|---|---|
| slot_info | struct slot_info | Maps tokens to cache cell indices with contiguity metadata |
| stream_copy_info | struct stream_copy_info | Source and destination streams for copy operations |
| k_l/v_l | ggml_tensor * | Per-layer key and value cache tensors |
Usage Examples
// Slot info usage
llama_kv_cache::slot_info sinfo;
uint32_t head_pos = sinfo.head(); // first cell index
bool contiguous = sinfo.is_contiguous(); // check if cells are contiguous
size_t n_tokens = sinfo.size(); // number of tokens in slot
// Stream copy info
llama_kv_cache::stream_copy_info copy_info;
if (!copy_info.empty()) {
// perform stream copy operations
}