Implementation:Ggml org Llama cpp Memory Header
| Knowledge Sources | |
|---|---|
| Domains | Memory, Abstraction |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Defines the abstract interfaces for LLM memory management, including both the persistent memory store and the per-batch processing context.
Description
This header declares `llama_memory_i` as the base interface for all memory types (KV cache, recurrent state, hybrid, iSWA) with methods for batch initialization, sequence operations (rm, cp, keep, add, div), position queries, state serialization, and memory breakdown reporting. It also defines `llama_memory_context_i` for managing per-batch processing state with `next()`, `apply()`, and `get_ubatch()` methods. The `llama_memory_status` enum tracks operation success/failure, and `llama_memory_params` holds cache configuration. Layer filtering and reuse are supported via callback typedefs.
Usage
This is the foundational abstraction layer that all memory implementations (KV cache, recurrent, hybrid, iSWA) implement. Include this header when writing code that interacts with LLM memory through the generic interface.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-memory.h
- Lines: 1-122
Signature
enum llama_memory_status {
LLAMA_MEMORY_STATUS_SUCCESS = 0,
LLAMA_MEMORY_STATUS_NO_UPDATE,
LLAMA_MEMORY_STATUS_FAILED_PREPARE,
LLAMA_MEMORY_STATUS_FAILED_COMPUTE,
};
struct llama_memory_params {
ggml_type type_k;
ggml_type type_v;
bool swa_full;
};
struct llama_memory_context_i {
virtual ~llama_memory_context_i() = default;
virtual bool next() = 0;
virtual bool apply() = 0;
virtual const llama_ubatch & get_ubatch() const = 0;
virtual llama_memory_status get_status() const = 0;
};
using llama_memory_context_ptr = std::unique_ptr<llama_memory_context_i>;
struct llama_memory_i {
using layer_filter_cb = std::function<bool(int32_t il)>;
using layer_reuse_cb = std::function<int32_t(int32_t il)>;
virtual llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) = 0;
virtual llama_memory_context_ptr init_full() = 0;
virtual llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) = 0;
virtual bool get_can_shift() const = 0;
virtual void clear(bool data) = 0;
virtual bool seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) = 0;
virtual void seq_cp(llama_seq_id src, llama_seq_id dst, llama_pos p0, llama_pos p1) = 0;
virtual void seq_keep(llama_seq_id seq_id) = 0;
virtual void seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos shift) = 0;
virtual void seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d) = 0;
virtual llama_pos seq_pos_min(llama_seq_id seq_id) const = 0;
virtual llama_pos seq_pos_max(llama_seq_id seq_id) const = 0;
virtual void state_write(llama_io_write_i & io, llama_seq_id seq_id = -1, llama_state_seq_flags flags = 0) const = 0;
virtual void state_read(llama_io_read_i & io, llama_seq_id seq_id = -1, llama_state_seq_flags flags = 0) = 0;
};
using llama_memory_ptr = std::unique_ptr<llama_memory_i>;
llama_memory_status llama_memory_status_combine(llama_memory_status s0, llama_memory_status s1);
bool llama_memory_status_is_fail(llama_memory_status status);
Import
#include "llama-memory.h"
// Dependencies:
#include "llama.h"
#include <map>
#include <memory>
#include <functional>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| balloc | llama_batch_allocr & | Yes | Batch allocator for init_batch |
| n_ubatch | uint32_t | Yes | Maximum tokens per micro-batch |
| embd_all | bool | Yes | Whether all tokens produce embeddings |
| seq_id | llama_seq_id | Yes | Sequence ID for sequence operations |
| p0 / p1 | llama_pos | Yes | Position range for sequence operations |
| lctx | llama_context * | Yes | Context for init_update |
Outputs
| Name | Type | Description |
|---|---|---|
| init_batch return | llama_memory_context_ptr | Processing context for batch inference |
| init_full return | llama_memory_context_ptr | Context simulating full cache for buffer allocation |
| init_update return | llama_memory_context_ptr | Context for pending memory updates (shifts, copies) |
| get_status() | llama_memory_status | Status of the current memory operation |
| memory_breakdown() | std::map<ggml_backend_buffer_type_t, size_t> | Memory usage by buffer type |
Usage Examples
#include "llama-memory.h"
// Generic memory usage through the interface
llama_memory_i * memory = get_memory();
// Initialize batch processing
auto ctx = memory->init_batch(balloc, n_ubatch, false);
if (llama_memory_status_is_fail(ctx->get_status())) {
// handle error
}
// Process ubatches
while (ctx->next()) {
ctx->apply();
const auto & ubatch = ctx->get_ubatch();
// build and compute graph
}
// Sequence management
memory->seq_rm(seq_id, 0, -1); // remove all positions
memory->seq_cp(0, 1, 0, -1); // copy sequence 0 to 1
memory->seq_keep(0); // keep only sequence 0