Implementation:Ollama Ollama Llama KV Cache
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Memory Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the primary KV (key-value) cache for transformer attention, handling cache allocation, slot management, sequence operations, defragmentation, state serialization, and the graph-level K-shift operation.
Description
The constructor allocates per-layer key and value tensors with appropriate backend buffer types, supporting both streaming (per-sequence) and unified modes. Implements slot finding/assignment for incoming batches, sequence operations (rm, cp, keep, add, div) that manipulate the underlying llama_kv_cells metadata, and defragmentation to compact fragmented cache slots. The llama_kv_cache_context class manages batch-level cache state including slot allocation, stream copy operations, and attention mask/position generation for the compute graph.
Usage
This is the most critical memory component for transformer inference. Every attention-based model routes key/value storage through this class, which stores computed key/value pairs from previous tokens to enable efficient autoregressive generation.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/llama-kv-cache.cpp - Lines: 1-2100
Signature
llama_kv_cache::llama_kv_cache(
const llama_model & model,
ggml_type type_k,
ggml_type type_v,
bool v_trans,
bool offload,
bool unified,
uint32_t kv_size,
uint32_t n_seq_max,
uint32_t n_pad,
uint32_t n_swa,
llama_swa_type swa_type,
const layer_filter_cb & filter,
const layer_reuse_cb & reuse);
slot_info_vec_t prepare(const std::vector<llama_ubatch> & ubatches);
slot_info find_slot(const llama_ubatch & ubatch, bool cont) const;
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
bool update(llama_context * lctx, bool do_shift, const stream_copy_info & sc_info);
Import
#include "llama-kv-cache.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | The model providing layer configuration and device info |
| type_k | ggml_type | Yes | Data type for key cache tensors |
| type_v | ggml_type | Yes | Data type for value cache tensors |
| kv_size | uint32_t | Yes | Total number of KV cache cells |
| n_swa | uint32_t | Yes | Sliding window size (0 for no SWA) |
Outputs
| Name | Type | Description |
|---|---|---|
| slot_info_vec_t | std::vector<slot_info> | Cache slot assignments for ubatches |
| get_k/get_v | ggml_tensor* | Key/value tensor views for graph building |
Usage Examples
// KV cache is created internally by the memory system
// Prepare slots for batches:
auto sinfos = kv_cache->prepare(ubatches);
// Get key/value tensors for graph construction:
ggml_tensor * k = ctx->get_k(ggml_ctx, layer_id);
ggml_tensor * v = ctx->get_v(ggml_ctx, layer_id);
// Sequence operations:
kv_cache->seq_rm(seq_id, 0, -1); // remove sequence
kv_cache->seq_add(seq_id, 0, -1, shift); // shift positions