Implementation:Ggml org Llama cpp Memory Hybrid

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Memory, Hybrid
Last Updated	2026-02-15 00:00 GMT

Overview

Implements hybrid memory management for models that combine attention-based and recurrent layers (e.g., Jamba).

Description

This implementation composes a `llama_kv_cache` for attention layers and a `llama_memory_recurrent` for recurrent layers, using layer filter callbacks to route each layer to the appropriate memory backend based on `hparams.is_recurrent()`. Batch initialization follows the recurrent splitting pattern (sequential equal split), then validates slots in both sub-memories. All sequence operations (rm, cp, keep, add, div) and state serialization are delegated to both sub-memories. The `llama_memory_hybrid_context` wraps both sub-contexts, coordinating their `next()`/`apply()` calls and combining their status values.

Usage

This module is used internally for hybrid architectures like Jamba that interleave transformer attention and Mamba-style recurrent layers within the same model. It is instantiated during model initialization when the model hparams indicate a hybrid architecture.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-memory-hybrid.cpp
Lines: 1-268

Signature

// Constructor
llama_memory_hybrid::llama_memory_hybrid(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v, bool v_trans,
    uint32_t kv_size, uint32_t n_pad, uint32_t n_swa, llama_swa_type swa_type,
    ggml_type type_r, ggml_type type_s, uint32_t rs_size,
    uint32_t n_seq_max, bool offload, bool unified,
    const layer_filter_cb & filter_attn, const layer_filter_cb & filter_recr);

// Memory interface
llama_memory_context_ptr llama_memory_hybrid::init_batch(
    llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);
llama_memory_context_ptr llama_memory_hybrid::init_full();
llama_memory_context_ptr llama_memory_hybrid::init_update(
    llama_context * lctx, bool optimize);

Import

#include "llama-memory-hybrid.h"
#include "llama-impl.h"
#include "llama-model.h"
#include "llama-context.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model reference with hparams for layer type detection
type_k / type_v	ggml_type	Yes	Key/value cache data types for attention layers
type_r / type_s	ggml_type	Yes	Recurrent state data types (r and s tensors)
kv_size	uint32_t	Yes	KV cache size for attention layers
rs_size	uint32_t	Yes	Recurrent state memory size
n_swa	uint32_t	Yes	Sliding window attention size
swa_type	llama_swa_type	Yes	Type of sliding window attention
filter_attn	const layer_filter_cb &	No	Callback for attention layer filtering (default: !is_recurrent)
filter_recr	const layer_filter_cb &	No	Callback for recurrent layer filtering (default: is_recurrent)

Outputs

Name	Type	Description
init_batch return	llama_memory_context_ptr	Context coordinating both attention and recurrent sub-contexts
get_mem_attn()	llama_kv_cache *	Pointer to the attention KV cache
get_mem_recr()	llama_memory_recurrent *	Pointer to the recurrent state memory

Usage Examples

#include "llama-memory-hybrid.h"

// Created internally during model initialization for Jamba-style models
auto mem = std::make_unique<llama_memory_hybrid>(
    model, type_k, type_v, v_trans,
    kv_size, n_pad, n_swa, swa_type,
    type_r, type_s, rs_size,
    n_seq_max, offload, unified);

// Batch processing
auto ctx = mem->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
    ctx->apply();
    // graph building accesses ctx->get_attn() and ctx->get_recr()
}

Related Pages

Principle:Ggml_org_Llama_cpp_HybridMemory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment