Implementation:Ggml org Llama cpp KV Cache ISWA
| Knowledge Sources | |
|---|---|
| Domains | KV_Cache, Memory |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements the `llama_kv_cache_iswa` class that manages two separate KV caches for models using Interleaved Sliding Window Attention (ISWA).
Description
This file creates two `llama_kv_cache` instances: `kv_base` for non-SWA layers (full context) and `kv_swa` for SWA layers (smaller, window-limited). Layer filter callbacks direct each layer to the appropriate cache based on `hparams.is_swa(il)`. The SWA cache size is computed from the sliding window size, padded to 256 for performance. All `llama_memory_i` operations (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, state_write/read) are delegated to both caches. The `llama_kv_cache_iswa_context` coordinates batch processing by managing slot infos and ubatches for both caches simultaneously.
Usage
Use this module for models that alternate between full-attention and sliding-window-attention layers, such as Gemma 2 and Cohere 2. It is automatically selected when the model's hyperparameters indicate ISWA layer configuration.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: src/llama-kv-cache-iswa.cpp
- Lines: 1-330
Signature
llama_kv_cache_iswa::llama_kv_cache_iswa(
const llama_model & model,
ggml_type type_k, ggml_type type_v,
bool v_trans, bool offload, bool swa_full, bool unified,
uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch, uint32_t n_pad,
const layer_filter_cb & filter, const layer_reuse_cb & reuse);
void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_keep(llama_seq_id seq_id);
void llama_kv_cache_iswa::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache_iswa::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);
Import
#include "llama-kv-cache-iswa.h"
#include "llama-impl.h"
#include "llama-batch.h"
#include "llama-model.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model with hyperparameters defining which layers are SWA |
| type_k | ggml_type | Yes | Data type for key tensors |
| type_v | ggml_type | Yes | Data type for value tensors |
| kv_size | uint32_t | Yes | Total KV cache size in cells |
| n_seq_max | uint32_t | Yes | Maximum number of sequences |
| filter | const layer_filter_cb & | No | Optional callback to filter which layers are included |
Outputs
| Name | Type | Description |
|---|---|---|
| kv_base | std::unique_ptr<llama_kv_cache> | KV cache for full-attention (non-SWA) layers |
| kv_swa | std::unique_ptr<llama_kv_cache> | KV cache for sliding-window-attention layers with reduced size |
Usage Examples
// Construction with ISWA model
auto kv_iswa = std::make_unique<llama_kv_cache_iswa>(
model, GGML_TYPE_F16, GGML_TYPE_F16,
true, true, false, false,
kv_size, n_seq_max, n_ubatch, n_pad,
nullptr, nullptr);
// All operations delegate to both caches
kv_iswa->clear(true);
kv_iswa->seq_rm(seq_id, p0, p1);
kv_iswa->seq_keep(seq_id);