Implementation:Ggml org Llama cpp KV Cache
| Knowledge Sources | |
|---|---|
| Domains | KV_Cache, Memory |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements the `llama_kv_cache` class, which manages key-value attention cache allocation, slot finding, defragmentation, and state serialization.
Description
This file constructs per-layer K and V tensors with appropriate buffer types and sizes, supporting multi-stream (per-sequence) or unified caching modes. It implements slot finding (`find_slot`) with contiguous and non-contiguous strategies, ubatch application (`apply_ubatch`) to insert tokens into cache cells, and cache maintenance operations (sequence removal/copy/keep/shift/divide). It also supports KV cache shifting for context extension, defragmentation to compact used cells, and state save/load for session persistence. The `llama_kv_cache_context` class manages per-ubatch slot information and coordinates batch processing.
Usage
Use this module as the primary KV cache implementation for transformer-based models. It is the essential backend for efficient autoregressive generation, storing previously computed key-value pairs to avoid redundant computation.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-kv-cache.cpp
- Lines: 1-2268
Signature
llama_kv_cache::llama_kv_cache(
const llama_model & model,
ggml_type type_k, ggml_type type_v,
bool v_trans, bool offload, bool unified,
uint32_t kv_size, uint32_t n_seq_max, uint32_t n_pad,
uint32_t n_swa, llama_swa_type swa_type,
const layer_filter_cb & filter, const layer_reuse_cb & reuse);
// Sequence operations
bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache::seq_keep(llama_seq_id seq_id);
void llama_kv_cache::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);
// Slot management
bool llama_kv_cache::find_slot(/* ... */);
void llama_kv_cache::apply_ubatch(/* ... */);
// State persistence
void llama_kv_cache::state_write(/* ... */);
void llama_kv_cache::state_read(/* ... */);
Import
#include "llama-kv-cache.h"
#include "llama-impl.h"
#include "llama-io.h"
#include "llama-model.h"
#include "llama-context.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model defining layer structure and buffer types for K/V tensors |
| type_k | ggml_type | Yes | Data type for key cache tensors (e.g., F16, Q8_0) |
| type_v | ggml_type | Yes | Data type for value cache tensors |
| kv_size | uint32_t | Yes | Total number of cache cells to allocate |
| n_seq_max | uint32_t | Yes | Maximum number of concurrent sequences |
| n_swa | uint32_t | No | Sliding window size (0 for no SWA) |
| swa_type | llama_swa_type | No | Type of sliding window attention |
Outputs
| Name | Type | Description |
|---|---|---|
| k_l | std::vector<ggml_tensor *> | Per-layer key cache tensors |
| v_l | std::vector<ggml_tensor *> | Per-layer value cache tensors |
| slot_info | struct slot_info | Cell indices mapping for each ubatch token placement |
| state | binary data | Serialized cache state for session persistence |
Usage Examples
// Create KV cache
auto kv_cache = std::make_unique<llama_kv_cache>(
model, GGML_TYPE_F16, GGML_TYPE_F16,
/*v_trans=*/true, /*offload=*/true, /*unified=*/false,
kv_size, n_seq_max, n_pad,
/*n_swa=*/0, LLAMA_SWA_TYPE_NONE,
nullptr, nullptr);
// Manage sequences
kv_cache->seq_rm(seq_id, 0, -1); // remove all positions for a sequence
kv_cache->seq_cp(0, 1, 0, -1); // copy sequence 0 to sequence 1
kv_cache->seq_keep(seq_id); // keep only this sequence
kv_cache->clear(true); // clear all cached data