Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp KV Cache ISWA Header

From Leeroopedia
Knowledge Sources
Domains KV_Cache, Memory
Last Updated 2026-02-15 00:00 GMT

Overview

Declares the `llama_kv_cache_iswa` and `llama_kv_cache_iswa_context` classes for dual-cache Interleaved Sliding Window Attention (ISWA) management.

Description

`llama_kv_cache_iswa` implements `llama_memory_i` and holds two `unique_ptr<llama_kv_cache>` instances: `kv_base` for full attention layers and `kv_swa` for sliding window attention layers. It exposes `get_base()` and `get_swa()` accessors. The `llama_kv_cache_iswa_context` wraps two `llama_memory_context_ptr` instances (one per cache) and manages batch iteration, forwarding `next()`/`apply()` calls and exposing the underlying contexts for graph building.

Usage

Include this header when implementing or working with models that interleave full and sliding window attention layers (e.g., Gemma2, Cohere2). It provides a clean abstraction that the rest of the codebase interacts with through the standard memory interface.

Code Reference

Source Location

Signature

class llama_kv_cache_iswa : public llama_memory_i {
public:
    llama_kv_cache_iswa(
        const llama_model & model, ggml_type type_k, ggml_type type_v,
        bool v_trans, bool offload, bool swa_full, bool unified,
        uint32_t kv_size, uint32_t n_seq_max, uint32_t n_ubatch,
        uint32_t n_pad, const layer_filter_cb & filter, const layer_reuse_cb & reuse);

    llama_memory_context_ptr init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all) override;
    llama_memory_context_ptr init_full() override;
    llama_memory_context_ptr init_update(llama_context * lctx, bool optimize) override;

    llama_kv_cache * get_base() const;
    llama_kv_cache * get_swa() const;
};

class llama_kv_cache_iswa_context : public llama_memory_context_i {
public:
    bool next() override;
    bool apply() override;
    llama_memory_status get_status() const override;
    const llama_ubatch & get_ubatch() const override;

    const llama_kv_cache_context * get_base() const;
    const llama_kv_cache_context * get_swa() const;
};

Import

#include "llama-kv-cache-iswa.h"
// Dependencies:
#include "llama-kv-cache.h"
#include <vector>

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes Model reference for layer configuration
type_k ggml_type Yes Data type for key cache tensors
type_v ggml_type Yes Data type for value cache tensors
v_trans bool Yes Whether to transpose value cache
swa_full bool Yes Whether to use full-size SWA cache
unified bool Yes Whether to use unified memory for both caches
kv_size uint32_t Yes Size of the KV cache
filter const layer_filter_cb & Yes Callback to filter which layers use which cache

Outputs

Name Type Description
get_base() llama_kv_cache * Pointer to the full-attention KV cache
get_swa() llama_kv_cache * Pointer to the sliding window attention KV cache
init_batch return llama_memory_context_ptr Context for batch processing with dual caches

Usage Examples

#include "llama-kv-cache-iswa.h"

// Access the dual caches
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa  = iswa_cache->get_swa();

// Initialize batch processing
auto ctx = iswa_cache->init_batch(balloc, n_ubatch, embd_all);
while (ctx->next()) {
    ctx->apply();
    const auto & ubatch = ctx->get_ubatch();
    // build compute graph using ctx->get_base() and ctx->get_swa()
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment