Implementation:Ggml org Llama cpp KV Cache ISWA Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	KV_Cache, Memory
Last Updated	2026-02-15 00:00 GMT

Overview

Declares the `llama_kv_cache_iswa` and `llama_kv_cache_iswa_context` classes for dual-cache Interleaved Sliding Window Attention (ISWA) management.

Description

`llama_kv_cache_iswa` implements `llama_memory_i` and holds two `unique_ptr<llama_kv_cache>` instances: `kv_base` for full attention layers and `kv_swa` for sliding window attention layers. It exposes `get_base()` and `get_swa()` accessors. The `llama_kv_cache_iswa_context` wraps two `llama_memory_context_ptr` instances (one per cache) and manages batch iteration, forwarding `next()`/`apply()` calls and exposing the underlying contexts for graph building.

Usage

Include this header when implementing or working with models that interleave full and sliding window attention layers (e.g., Gemma2, Cohere2). It provides a clean abstraction that the rest of the codebase interacts with through the standard memory interface.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-kv-cache-iswa.h
Lines: 1-137

Signature

class llama_kv_cache_iswa : public llama_memory_i {
public:
    llama_kv_cache_iswa(
        const llama_model & model, ggml_type type_k, ggml_type type_v,
        bool v_trans, bool offload, bool swa_full, bool unified,
        uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch,
        uint32_t n_pad, const layer_filter_cb & filter, const layer_reuse_cb & reuse);

    llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) override;
    llama_memory_context_ptr init_full() override;
    llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) override;

    llama_kv_cache * get_base() const;
    llama_kv_cache * get_swa() const;
};

class llama_kv_cache_iswa_context : public llama_memory_context_i {
public:
    bool next() override;
    bool apply() override;
    llama_memory_status get_status() const override;
    const llama_ubatch & get_ubatch() const override;

    const llama_kv_cache_context * get_base() const;
    const llama_kv_cache_context * get_swa() const;
};

Import

#include "llama-kv-cache-iswa.h"
// Dependencies:
#include "llama-kv-cache.h"
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model reference for layer configuration
type_k	ggml_type	Yes	Data type for key cache tensors
type_v	ggml_type	Yes	Data type for value cache tensors
v_trans	bool	Yes	Whether to transpose value cache
swa_full	bool	Yes	Whether to use full-size SWA cache
unified	bool	Yes	Whether to use unified memory for both caches
kv_size	uint32_t	Yes	Size of the KV cache
filter	const layer_filter_cb &	Yes	Callback to filter which layers use which cache

Outputs

Name	Type	Description
get_base()	llama_kv_cache *	Pointer to the full-attention KV cache
get_swa()	llama_kv_cache *	Pointer to the sliding window attention KV cache
init_batch return	llama_memory_context_ptr	Context for batch processing with dual caches

Usage Examples

#include "llama-kv-cache-iswa.h"

// Access the dual caches
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa  = iswa_cache->get_swa();

// Initialize batch processing
auto ctx = iswa_cache->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
    ctx->apply();
    const auto & ubatch = ctx->get_ubatch();
    // build compute graph using ctx->get_base() and ctx->get_swa()
}

Related Pages

Principle:Ggml_org_Llama_cpp_KVCache

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment