Implementation:Ggml org Llama cpp KV Cache ISWA

Knowledge Sources	Ggml_org_Llama_cpp
Domains	KV_Cache, Memory
Last Updated	2026-02-15 00:00 GMT

Overview

Implements the `llama_kv_cache_iswa` class that manages two separate KV caches for models using Interleaved Sliding Window Attention (ISWA).

Description

This file creates two `llama_kv_cache` instances: `kv_base` for non-SWA layers (full context) and `kv_swa` for SWA layers (smaller, window-limited). Layer filter callbacks direct each layer to the appropriate cache based on `hparams.is_swa(il)`. The SWA cache size is computed from the sliding window size, padded to 256 for performance. All `llama_memory_i` operations (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, state_write/read) are delegated to both caches. The `llama_kv_cache_iswa_context` coordinates batch processing by managing slot infos and ubatches for both caches simultaneously.

Usage

Use this module for models that alternate between full-attention and sliding-window-attention layers, such as Gemma 2 and Cohere 2. It is automatically selected when the model's hyperparameters indicate ISWA layer configuration.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-kv-cache-iswa.cpp
Lines: 1-330

Signature

llama_kv_cache_iswa::llama_kv_cache_iswa(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v,
    bool v_trans, bool offload, bool swa_full, bool unified,
    uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch, uint32_t n_pad,
    const layer_filter_cb & filter, const layer_reuse_cb & reuse);

void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_keep(llama_seq_id seq_id);
void llama_kv_cache_iswa::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache_iswa::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);

Import

#include "llama-kv-cache-iswa.h"
#include "llama-impl.h"
#include "llama-batch.h"
#include "llama-model.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model with hyperparameters defining which layers are SWA
type_k	ggml_type	Yes	Data type for key tensors
type_v	ggml_type	Yes	Data type for value tensors
kv_size	uint32_t	Yes	Total KV cache size in cells
n_seq_max	uint32_t	Yes	Maximum number of sequences
filter	const layer_filter_cb &	No	Optional callback to filter which layers are included

Outputs

Name	Type	Description
kv_base	std::unique_ptr<llama_kv_cache>	KV cache for full-attention (non-SWA) layers
kv_swa	std::unique_ptr<llama_kv_cache>	KV cache for sliding-window-attention layers with reduced size

Usage Examples

// Construction with ISWA model
auto kv_iswa = std::make_unique<llama_kv_cache_iswa>(
    model, GGML_TYPE_F16, GGML_TYPE_F16,
    true, true, false, false,
    kv_size, n_seq_max, n_ubatch, n_pad,
    nullptr, nullptr);

// All operations delegate to both caches
kv_iswa->clear(true);
kv_iswa->seq_rm(seq_id, p0, p1);
kv_iswa->seq_keep(seq_id);

Related Pages

Principle:Ggml_org_Llama_cpp_KVCacheManagement

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment