Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Llama KV Cache ISWA

From Leeroopedia
Knowledge Sources
Domains LLM Inference, Memory Management
Last Updated 2025-02-15 00:00 GMT

Overview

Implements the Interleaved Sliding Window Attention (ISWA) KV cache, which manages two separate KV caches for SWA and non-SWA model layers.

Description

The llama_kv_cache_iswa class creates two llama_kv_cache instances: one for non-SWA (dense attention) layers and one for SWA layers, using layer filter callbacks to route layers to the appropriate cache. The SWA cache is sized based on the sliding window size plus ubatch padding, enabling significant memory savings. All llama_memory_i interface methods (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, etc.) delegate to both underlying caches.

Usage

Used for models that mix dense and sliding window attention layers, such as Gemma 2 and Mistral. The ISWA cache automatically manages two separate caches so that SWA layers use a smaller cache while dense layers retain full attention history.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/src/llama-kv-cache-iswa.cpp
  • Lines: 1-328

Signature

llama_kv_cache_iswa::llama_kv_cache_iswa(
        const llama_model & model,
                ggml_type   type_k,
                ggml_type   type_v,
                     bool   v_trans,
                     bool   offload,
                     bool   swa_full,
                     bool   unified,
                 uint32_t   kv_size,
                 uint32_t   n_seq_max,
                 uint32_t   n_ubatch,
                 uint32_t   n_pad,
    const layer_filter_cb & filter,
    const  layer_reuse_cb & reuse);

void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
llama_memory_context_ptr llama_kv_cache_iswa::init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);

Import

#include "llama-kv-cache-iswa.h"

I/O Contract

Inputs

Name Type Required Description
model const llama_model & Yes The model whose layers determine SWA vs non-SWA routing
type_k ggml_type Yes Data type for key tensors
type_v ggml_type Yes Data type for value tensors
v_trans bool Yes Whether value tensors are transposed
swa_full bool Yes Use full-size SWA cache (same size as base)
kv_size uint32_t Yes Total KV cache size in cells

Outputs

Name Type Description
kv_base llama_kv_cache* The non-SWA (dense) KV cache
kv_swa llama_kv_cache* The SWA (sliding window) KV cache

Usage Examples

// ISWA cache is created internally by llama_model::init_memory()
// Access the sub-caches:
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa  = iswa_cache->get_swa();

// All sequence operations delegate to both caches
iswa_cache->seq_rm(seq_id, p0, p1);
iswa_cache->clear(true);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment