Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Memory Hybrid ISWA

From Leeroopedia
Knowledge Sources
Domains Memory, Hybrid
Last Updated 2026-02-15 00:00 GMT

Overview

Implements hybrid memory management combining iSWA (interleaved Sliding Window Attention) KV cache with recurrent state memory.

Description

This implementation composes `llama_kv_cache_iswa` (which maintains separate base and SWA caches) and `llama_memory_recurrent`, routing layers via filter callbacks based on `hparams.is_recurrent()`. It follows the same batch splitting and coordination pattern as `llama_memory_hybrid` but uses the iSWA cache variant, which tracks separate slot info vectors for base and SWA layers. The context class coordinates both sub-contexts and combines their status values.

Usage

This module is used internally by models that combine sliding window attention with recurrent layers (e.g., Cohere2, Gemma2-style models with recurrent components). It provides the most feature-rich hybrid memory configuration in llama.cpp.

Code Reference

Source Location

Signature

// Constructor
llama_memory_hybrid_iswa::llama_memory_hybrid_iswa(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v, bool v_trans, bool swa_full,
    uint32_t kv_size, uint32_t n_ubatch, uint32_t n_pad,
    ggml_type type_r, ggml_type type_s, uint32_t rs_size,
    uint32_t n_seq_max, bool offload, bool unified,
    const layer_filter_cb & filter_attn, const layer_filter_cb & filter_recr);

// Memory interface methods
llama_memory_context_ptr llama_memory_hybrid_iswa::init_batch(
    llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);
llama_memory_context_ptr llama_memory_hybrid_iswa::init_full();
llama_memory_context_ptr llama_memory_hybrid_iswa::init_update(
    llama_context * lctx, bool optimize);

Import

#include "llama-memory-hybrid-iswa.h"
#include "llama-impl.h"
#include "llama-model.h"
#include "llama-context.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Model with hparams describing layer types
type_k / type_v ggml_type Yes Key/value cache data types for attention layers
type_r / type_s ggml_type Yes Recurrent state data types (r and s tensors)
kv_size uint32_t Yes Size of the KV cache for attention layers
rs_size uint32_t Yes Size of the recurrent state memory
filter_attn const layer_filter_cb & No Filter callback for attention layers (defaults to !is_recurrent)
filter_recr const layer_filter_cb & No Filter callback for recurrent layers (defaults to is_recurrent)

Outputs

Name Type Description
init_batch return llama_memory_context_ptr Context coordinating both iSWA attention and recurrent memory
get_mem_attn() llama_kv_cache_iswa * Access to the iSWA attention cache
get_mem_recr() llama_memory_recurrent * Access to the recurrent memory

Usage Examples

#include "llama-memory-hybrid-iswa.h"

// Created internally during model initialization for hybrid iSWA models
auto mem = std::make_unique<llama_memory_hybrid_iswa>(
    model, type_k, type_v, v_trans, swa_full,
    kv_size, n_ubatch, n_pad,
    type_r, type_s, rs_size,
    n_seq_max, offload, unified);

// Batch processing
auto ctx = mem->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
    ctx->apply();
    // graph building uses ctx->get_attn() and ctx->get_recr()
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment