Implementation:Ggml org Llama cpp Memory Hybrid Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Memory, Hybrid
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the hybrid memory class that combines KV cache and recurrent state memory for models with mixed attention and recurrent layers.

Description

`llama_memory_hybrid` implements `llama_memory_i` by composing `llama_kv_cache` (for attention layers) and `llama_memory_recurrent` (for recurrent layers), with configurable layer filter callbacks. `llama_memory_hybrid_context` coordinates the two sub-contexts, providing accessors `get_attn()` and `get_recr()` so graph builders can retrieve the appropriate memory context for each layer type.

Usage

Include this header when implementing or modifying support for hybrid transformer/SSM models like Jamba. It is a key abstraction that enables the unified memory interface to work with architectures mixing attention and recurrent mechanisms.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-memory-hybrid.h
Lines: 1-139

Signature

class llama_memory_hybrid : public llama_memory_i {
public:
    llama_memory_hybrid(
        const llama_model & model,
        ggml_type type_k, ggml_type type_v, bool v_trans,
        uint32_t kv_size, uint32_t n_pad, uint32_t n_swa, llama_swa_type swa_type,
        ggml_type type_r, ggml_type type_s, uint32_t rs_size,
        uint32_t n_seq_max, bool offload, bool unified,
        const layer_filter_cb & filter_attn = nullptr,
        const layer_filter_cb & filter_recr = nullptr);

    llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) override;
    llama_memory_context_ptr init_full() override;
    llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) override;

    llama_kv_cache * get_mem_attn() const;
    llama_memory_recurrent * get_mem_recr() const;
};

class llama_memory_hybrid_context : public llama_memory_context_i {
public:
    bool next() override;
    bool apply() override;
    llama_memory_status get_status() const override;
    const llama_ubatch & get_ubatch() const override;

    const llama_kv_cache_context * get_attn() const;
    const llama_memory_recurrent_context * get_recr() const;
};

Import

#include "llama-memory-hybrid.h"
// Dependencies:
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-kv-cache.h"
#include "llama-memory.h"
#include "llama-memory-recurrent.h"
#include <memory>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model reference with hparams describing layer types
type_k / type_v	ggml_type	Yes	Key/value cache data types for attention layers
type_r / type_s	ggml_type	Yes	Recurrent state data types
kv_size	uint32_t	Yes	KV cache size for attention layers
rs_size	uint32_t	Yes	Recurrent state memory size
filter_attn	const layer_filter_cb &	No	Attention layer filter (default: !is_recurrent)
filter_recr	const layer_filter_cb &	No	Recurrent layer filter (default: is_recurrent)

Outputs

Name	Type	Description
get_mem_attn()	llama_kv_cache *	Pointer to the composed KV cache for attention layers
get_mem_recr()	llama_memory_recurrent *	Pointer to the composed recurrent memory
get_attn()	const llama_kv_cache_context *	Attention context for graph building
get_recr()	const llama_memory_recurrent_context *	Recurrent context for graph building

Usage Examples

#include "llama-memory-hybrid.h"

// Access sub-memory components
auto * attn = hybrid_mem->get_mem_attn();
auto * recr = hybrid_mem->get_mem_recr();

// During graph building, access the appropriate context
const auto * attn_ctx = hybrid_ctx->get_attn();
const auto * recr_ctx = hybrid_ctx->get_recr();

Related Pages

Principle:Ggml_org_Llama_cpp_HybridMemory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment