Implementation:Ggml org Llama cpp KV Cache ISWA Header
| Knowledge Sources | |
|---|---|
| Domains | KV_Cache, Memory |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares the `llama_kv_cache_iswa` and `llama_kv_cache_iswa_context` classes for dual-cache Interleaved Sliding Window Attention (ISWA) management.
Description
`llama_kv_cache_iswa` implements `llama_memory_i` and holds two `unique_ptr<llama_kv_cache>` instances: `kv_base` for full attention layers and `kv_swa` for sliding window attention layers. It exposes `get_base()` and `get_swa()` accessors. The `llama_kv_cache_iswa_context` wraps two `llama_memory_context_ptr` instances (one per cache) and manages batch iteration, forwarding `next()`/`apply()` calls and exposing the underlying contexts for graph building.
Usage
Include this header when implementing or working with models that interleave full and sliding window attention layers (e.g., Gemma2, Cohere2). It provides a clean abstraction that the rest of the codebase interacts with through the standard memory interface.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-kv-cache-iswa.h
- Lines: 1-137
Signature
class llama_kv_cache_iswa : public llama_memory_i {
public:
llama_kv_cache_iswa(
const llama_model & model, ggml_type type_k, ggml_type type_v,
bool v_trans, bool offload, bool swa_full, bool unified,
uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch,
uint32_t n_pad, const layer_filter_cb & filter, const layer_reuse_cb & reuse);
llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) override;
llama_memory_context_ptr init_full() override;
llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) override;
llama_kv_cache * get_base() const;
llama_kv_cache * get_swa() const;
};
class llama_kv_cache_iswa_context : public llama_memory_context_i {
public:
bool next() override;
bool apply() override;
llama_memory_status get_status() const override;
const llama_ubatch & get_ubatch() const override;
const llama_kv_cache_context * get_base() const;
const llama_kv_cache_context * get_swa() const;
};
Import
#include "llama-kv-cache-iswa.h"
// Dependencies:
#include "llama-kv-cache.h"
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model reference for layer configuration |
| type_k | ggml_type | Yes | Data type for key cache tensors |
| type_v | ggml_type | Yes | Data type for value cache tensors |
| v_trans | bool | Yes | Whether to transpose value cache |
| swa_full | bool | Yes | Whether to use full-size SWA cache |
| unified | bool | Yes | Whether to use unified memory for both caches |
| kv_size | uint32_t | Yes | Size of the KV cache |
| filter | const layer_filter_cb & | Yes | Callback to filter which layers use which cache |
Outputs
| Name | Type | Description |
|---|---|---|
| get_base() | llama_kv_cache * | Pointer to the full-attention KV cache |
| get_swa() | llama_kv_cache * | Pointer to the sliding window attention KV cache |
| init_batch return | llama_memory_context_ptr | Context for batch processing with dual caches |
Usage Examples
#include "llama-kv-cache-iswa.h"
// Access the dual caches
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa = iswa_cache->get_swa();
// Initialize batch processing
auto ctx = iswa_cache->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
ctx->apply();
const auto & ubatch = ctx->get_ubatch();
// build compute graph using ctx->get_base() and ctx->get_swa()
}