Implementation:Ggml org Llama cpp Memory Hybrid ISWA
| Knowledge Sources | |
|---|---|
| Domains | Memory, Hybrid |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements hybrid memory management combining iSWA (interleaved Sliding Window Attention) KV cache with recurrent state memory.
Description
This implementation composes `llama_kv_cache_iswa` (which maintains separate base and SWA caches) and `llama_memory_recurrent`, routing layers via filter callbacks based on `hparams.is_recurrent()`. It follows the same batch splitting and coordination pattern as `llama_memory_hybrid` but uses the iSWA cache variant, which tracks separate slot info vectors for base and SWA layers. The context class coordinates both sub-contexts and combines their status values.
Usage
This module is used internally by models that combine sliding window attention with recurrent layers (e.g., Cohere2, Gemma2-style models with recurrent components). It provides the most feature-rich hybrid memory configuration in llama.cpp.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-memory-hybrid-iswa.cpp
- Lines: 1-275
Signature
// Constructor
llama_memory_hybrid_iswa::llama_memory_hybrid_iswa(
const llama_model & model,
ggml_type type_k, ggml_type type_v, bool v_trans, bool swa_full,
uint32_t kv_size, uint32_t n_ubatch, uint32_t n_pad,
ggml_type type_r, ggml_type type_s, uint32_t rs_size,
uint32_t n_seq_max, bool offload, bool unified,
const layer_filter_cb & filter_attn, const layer_filter_cb & filter_recr);
// Memory interface methods
llama_memory_context_ptr llama_memory_hybrid_iswa::init_batch(
llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);
llama_memory_context_ptr llama_memory_hybrid_iswa::init_full();
llama_memory_context_ptr llama_memory_hybrid_iswa::init_update(
llama_context * lctx, bool optimize);
Import
#include "llama-memory-hybrid-iswa.h"
#include "llama-impl.h"
#include "llama-model.h"
#include "llama-context.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model with hparams describing layer types |
| type_k / type_v | ggml_type | Yes | Key/value cache data types for attention layers |
| type_r / type_s | ggml_type | Yes | Recurrent state data types (r and s tensors) |
| kv_size | uint32_t | Yes | Size of the KV cache for attention layers |
| rs_size | uint32_t | Yes | Size of the recurrent state memory |
| filter_attn | const layer_filter_cb & | No | Filter callback for attention layers (defaults to !is_recurrent) |
| filter_recr | const layer_filter_cb & | No | Filter callback for recurrent layers (defaults to is_recurrent) |
Outputs
| Name | Type | Description |
|---|---|---|
| init_batch return | llama_memory_context_ptr | Context coordinating both iSWA attention and recurrent memory |
| get_mem_attn() | llama_kv_cache_iswa * | Access to the iSWA attention cache |
| get_mem_recr() | llama_memory_recurrent * | Access to the recurrent memory |
Usage Examples
#include "llama-memory-hybrid-iswa.h"
// Created internally during model initialization for hybrid iSWA models
auto mem = std::make_unique<llama_memory_hybrid_iswa>(
model, type_k, type_v, v_trans, swa_full,
kv_size, n_ubatch, n_pad,
type_r, type_s, rs_size,
n_seq_max, offload, unified);
// Batch processing
auto ctx = mem->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
ctx->apply();
// graph building uses ctx->get_attn() and ctx->get_recr()
}