Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Memory Header

From Leeroopedia
Revision as of 12:41, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_Memory_Header.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Memory, Abstraction
Last Updated 2026-02-15 00:00 GMT

Overview

Defines the abstract interfaces for LLM memory management, including both the persistent memory store and the per-batch processing context.

Description

This header declares `llama_memory_i` as the base interface for all memory types (KV cache, recurrent state, hybrid, iSWA) with methods for batch initialization, sequence operations (rm, cp, keep, add, div), position queries, state serialization, and memory breakdown reporting. It also defines `llama_memory_context_i` for managing per-batch processing state with `next()`, `apply()`, and `get_ubatch()` methods. The `llama_memory_status` enum tracks operation success/failure, and `llama_memory_params` holds cache configuration. Layer filtering and reuse are supported via callback typedefs.

Usage

This is the foundational abstraction layer that all memory implementations (KV cache, recurrent, hybrid, iSWA) implement. Include this header when writing code that interacts with LLM memory through the generic interface.

Code Reference

Source Location

Signature

enum llama_memory_status {
    LLAMA_MEMORY_STATUS_SUCCESS = 0,
    LLAMA_MEMORY_STATUS_NO_UPDATE,
    LLAMA_MEMORY_STATUS_FAILED_PREPARE,
    LLAMA_MEMORY_STATUS_FAILED_COMPUTE,
};

struct llama_memory_params {
    ggml_type type_k;
    ggml_type type_v;
    bool swa_full;
};

struct llama_memory_context_i {
    virtual ~llama_memory_context_i() = default;
    virtual bool next() = 0;
    virtual bool apply() = 0;
    virtual const llama_ubatch & get_ubatch() const = 0;
    virtual llama_memory_status get_status() const = 0;
};

using llama_memory_context_ptr = std::unique_ptr<llama_memory_context_i>;

struct llama_memory_i {
    using layer_filter_cb = std::function<bool(int32_t il)>;
    using layer_reuse_cb = std::function<int32_t(int32_t il)>;

    virtual llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) = 0;
    virtual llama_memory_context_ptr init_full() = 0;
    virtual llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) = 0;

    virtual bool get_can_shift() const = 0;
    virtual void clear(bool data) = 0;
    virtual bool seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) = 0;
    virtual void seq_cp(llama_seq_id src, llama_seq_id dst, llama_pos p0, llama_pos p1) = 0;
    virtual void seq_keep(llama_seq_id seq_id) = 0;
    virtual void seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos shift) = 0;
    virtual void seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d) = 0;
    virtual llama_pos seq_pos_min(llama_seq_id seq_id) const = 0;
    virtual llama_pos seq_pos_max(llama_seq_id seq_id) const = 0;

    virtual void state_write(llama_io_write_i & io, llama_seq_id seq_id = -1, llama_state_seq_flags flags = 0) const = 0;
    virtual void state_read(llama_io_read_i & io, llama_seq_id seq_id = -1, llama_state_seq_flags flags = 0) = 0;
};

using llama_memory_ptr = std::unique_ptr<llama_memory_i>;

llama_memory_status llama_memory_status_combine(llama_memory_status s0, llama_memory_status s1);
bool llama_memory_status_is_fail(llama_memory_status status);

Import

#include "llama-memory.h"
// Dependencies:
#include "llama.h"
#include <map>
#include <memory>
#include <functional>

I/O Contract

Inputs

Name Type Required Description
balloc llama_batch_allocr & Yes Batch allocator for init_batch
n_ubatch uint32_t Yes Maximum tokens per micro-batch
embd_all bool Yes Whether all tokens produce embeddings
seq_id llama_seq_id Yes Sequence ID for sequence operations
p0 / p1 llama_pos Yes Position range for sequence operations
lctx llama_context * Yes Context for init_update

Outputs

Name Type Description
init_batch return llama_memory_context_ptr Processing context for batch inference
init_full return llama_memory_context_ptr Context simulating full cache for buffer allocation
init_update return llama_memory_context_ptr Context for pending memory updates (shifts, copies)
get_status() llama_memory_status Status of the current memory operation
memory_breakdown() std::map<ggml_backend_buffer_type_t, size_t> Memory usage by buffer type

Usage Examples

#include "llama-memory.h"

// Generic memory usage through the interface
llama_memory_i * memory = get_memory();

// Initialize batch processing
auto ctx = memory->init_batch(balloc, n_ubatch, false);
if (llama_memory_status_is_fail(ctx->get_status())) {
    // handle error
}

// Process ubatches
while (ctx->next()) {
    ctx->apply();
    const auto & ubatch = ctx->get_ubatch();
    // build and compute graph
}

// Sequence management
memory->seq_rm(seq_id, 0, -1);   // remove all positions
memory->seq_cp(0, 1, 0, -1);     // copy sequence 0 to 1
memory->seq_keep(0);              // keep only sequence 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment