Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp KV Cache ISWA

From Leeroopedia
Knowledge Sources
Domains KV_Cache, Memory
Last Updated 2026-02-15 00:00 GMT

Overview

Implements the `llama_kv_cache_iswa` class that manages two separate KV caches for models using Interleaved Sliding Window Attention (ISWA).

Description

This file creates two `llama_kv_cache` instances: `kv_base` for non-SWA layers (full context) and `kv_swa` for SWA layers (smaller, window-limited). Layer filter callbacks direct each layer to the appropriate cache based on `hparams.is_swa(il)`. The SWA cache size is computed from the sliding window size, padded to 256 for performance. All `llama_memory_i` operations (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, state_write/read) are delegated to both caches. The `llama_kv_cache_iswa_context` coordinates batch processing by managing slot infos and ubatches for both caches simultaneously.

Usage

Use this module for models that alternate between full-attention and sliding-window-attention layers, such as Gemma 2 and Cohere 2. It is automatically selected when the model's hyperparameters indicate ISWA layer configuration.

Code Reference

Source Location

Signature

llama_kv_cache_iswa::llama_kv_cache_iswa(
    const llama_model & model,
    ggml_type type_k, ggml_type type_v,
    bool v_trans, bool offload, bool swa_full, bool unified,
    uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch, uint32_t n_pad,
    const layer_filter_cb & filter, const layer_reuse_cb & reuse);

void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_keep(llama_seq_id seq_id);
void llama_kv_cache_iswa::seq_add(llama_seq_id seq_id, llama_pos p0, llama_pos p1, llama_pos delta);
void llama_kv_cache_iswa::seq_div(llama_seq_id seq_id, llama_pos p0, llama_pos p1, int d);

Import

#include "llama-kv-cache-iswa.h"
#include "llama-impl.h"
#include "llama-batch.h"
#include "llama-model.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Model with hyperparameters defining which layers are SWA
type_k ggml_type Yes Data type for key tensors
type_v ggml_type Yes Data type for value tensors
kv_size uint32_t Yes Total KV cache size in cells
n_seq_max uint32_t Yes Maximum number of sequences
filter const layer_filter_cb & No Optional callback to filter which layers are included

Outputs

Name Type Description
kv_base std::unique_ptr<llama_kv_cache> KV cache for full-attention (non-SWA) layers
kv_swa std::unique_ptr<llama_kv_cache> KV cache for sliding-window-attention layers with reduced size

Usage Examples

// Construction with ISWA model
auto kv_iswa = std::make_unique<llama_kv_cache_iswa>(
    model, GGML_TYPE_F16, GGML_TYPE_F16,
    true, true, false, false,
    kv_size, n_seq_max, n_ubatch, n_pad,
    nullptr, nullptr);

// All operations delegate to both caches
kv_iswa->clear(true);
kv_iswa->seq_rm(seq_id, p0, p1);
kv_iswa->seq_keep(seq_id);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment