Implementation:Ggml org Llama cpp Memory Hybrid ISWA

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Memory, Hybrid
Last Updated	2026-02-15 00:00 GMT

Overview

Implements hybrid memory management combining iSWA (interleaved Sliding Window Attention) KV cache with recurrent state memory.

Description

This implementation composes `llama_kv_cache_iswa` (which maintains separate base and SWA caches) and `llama_memory_recurrent`, routing layers via filter callbacks based on `hparams.is_recurrent()`. It follows the same batch splitting and coordination pattern as `llama_memory_hybrid` but uses the iSWA cache variant, which tracks separate slot info vectors for base and SWA layers. The context class coordinates both sub-contexts and combines their status values.

Usage

This module is used internally by models that combine sliding window attention with recurrent layers (e.g., Cohere2, Gemma2-style models with recurrent components). It provides the most feature-rich hybrid memory configuration in llama.cpp.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-memory-hybrid-iswa.cpp
Lines: 1-275

Signature

// Constructor
llama_memory_hybrid_iswa::llama_memory_hybrid_iswa(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v, bool v_trans, bool swa_full,
    uint32_t kv_size, uint32_t n_ubatch, uint32_t n_pad,
    ggml_type type_r, ggml_type type_s, uint32_t rs_size,
    uint32_t n_seq_max, bool offload, bool unified,
    const layer_filter_cb & filter_attn, const layer_filter_cb & filter_recr);

// Memory interface methods
llama_memory_context_ptr llama_memory_hybrid_iswa::init_batch(
    llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);
llama_memory_context_ptr llama_memory_hybrid_iswa::init_full();
llama_memory_context_ptr llama_memory_hybrid_iswa::init_update(
    llama_context * lctx, bool optimize);

Import

#include "llama-memory-hybrid-iswa.h"
#include "llama-impl.h"
#include "llama-model.h"
#include "llama-context.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model with hparams describing layer types
type_k / type_v	ggml_type	Yes	Key/value cache data types for attention layers
type_r / type_s	ggml_type	Yes	Recurrent state data types (r and s tensors)
kv_size	uint32_t	Yes	Size of the KV cache for attention layers
rs_size	uint32_t	Yes	Size of the recurrent state memory
filter_attn	const layer_filter_cb &	No	Filter callback for attention layers (defaults to !is_recurrent)
filter_recr	const layer_filter_cb &	No	Filter callback for recurrent layers (defaults to is_recurrent)

Outputs

Name	Type	Description
init_batch return	llama_memory_context_ptr	Context coordinating both iSWA attention and recurrent memory
get_mem_attn()	llama_kv_cache_iswa *	Access to the iSWA attention cache
get_mem_recr()	llama_memory_recurrent *	Access to the recurrent memory

Usage Examples

#include "llama-memory-hybrid-iswa.h"

// Created internally during model initialization for hybrid iSWA models
auto mem = std::make_unique<llama_memory_hybrid_iswa>(
    model, type_k, type_v, v_trans, swa_full,
    kv_size, n_ubatch, n_pad,
    type_r, type_s, rs_size,
    n_seq_max, offload, unified);

// Batch processing
auto ctx = mem->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
    ctx->apply();
    // graph building uses ctx->get_attn() and ctx->get_recr()
}

Related Pages

Principle:Ggml_org_Llama_cpp_HybridMemory

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment