Implementation:Ggml org Llama cpp Memory Hybrid ISWA Header
| Knowledge Sources | |
|---|---|
| Domains | Memory, Hybrid |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the hybrid memory class combining iSWA (interleaved Sliding Window Attention) cache with recurrent state memory.
Description
`llama_memory_hybrid_iswa` implements `llama_memory_i` by composing `llama_kv_cache_iswa` (for attention layers with SWA support) and `llama_memory_recurrent` (for recurrent layers). The context class `llama_memory_hybrid_iswa_context` tracks separate slot info vectors for base and SWA attention layers alongside the recurrent context, providing `get_attn()` and `get_recr()` accessors for graph builders to access the appropriate memory for each layer.
Usage
Include this header when working with hybrid architectures that combine sliding window attention, full attention, and recurrent mechanisms across different layers. It supports the most complex hybrid memory configurations in llama.cpp.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-memory-hybrid-iswa.h
- Lines: 1-140
Signature
class llama_memory_hybrid_iswa : public llama_memory_i {
public:
llama_memory_hybrid_iswa(
const llama_model & model,
ggml_type type_k, ggml_type type_v, bool v_trans, bool swa_full,
uint32_t kv_size, uint32_t n_ubatch, uint32_t n_pad,
ggml_type type_r, ggml_type type_s, uint32_t rs_size,
uint32_t n_seq_max, bool offload, bool unified,
const layer_filter_cb & filter_attn = nullptr,
const layer_filter_cb & filter_recr = nullptr);
llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) override;
llama_memory_context_ptr init_full() override;
llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) override;
llama_kv_cache_iswa * get_mem_attn() const;
llama_memory_recurrent * get_mem_recr() const;
};
class llama_memory_hybrid_iswa_context : public llama_memory_context_i {
public:
bool next() override;
bool apply() override;
llama_memory_status get_status() const override;
const llama_ubatch & get_ubatch() const override;
const llama_kv_cache_iswa_context * get_attn() const;
const llama_memory_recurrent_context * get_recr() const;
};
Import
#include "llama-memory-hybrid-iswa.h"
// Dependencies:
#include "llama-batch.h"
#include "llama-graph.h"
#include "llama-kv-cache-iswa.h"
#include "llama-memory.h"
#include "llama-memory-recurrent.h"
#include <memory>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model with hparams describing layer types (attention vs recurrent) |
| type_k / type_v | ggml_type | Yes | Key/value cache data types for attention layers |
| type_r / type_s | ggml_type | Yes | Recurrent state data types |
| kv_size | uint32_t | Yes | Size of the iSWA KV cache |
| rs_size | uint32_t | Yes | Size of the recurrent state memory |
| filter_attn | const layer_filter_cb & | No | Layer filter for attention (default: !is_recurrent) |
| filter_recr | const layer_filter_cb & | No | Layer filter for recurrent (default: is_recurrent) |
Outputs
| Name | Type | Description |
|---|---|---|
| get_mem_attn() | llama_kv_cache_iswa * | Pointer to the composed iSWA attention cache |
| get_mem_recr() | llama_memory_recurrent * | Pointer to the composed recurrent memory |
| get_attn() | const llama_kv_cache_iswa_context * | Attention context for graph building |
| get_recr() | const llama_memory_recurrent_context * | Recurrent context for graph building |
Usage Examples
#include "llama-memory-hybrid-iswa.h"
// Typically created internally by the model initialization code
llama_memory_hybrid_iswa mem(model, type_k, type_v, v_trans, swa_full,
kv_size, n_ubatch, n_pad, type_r, type_s, rs_size,
n_seq_max, offload, unified);
// Access sub-memories for inspection
auto * attn_cache = mem.get_mem_attn();
auto * recr_mem = mem.get_mem_recr();