Implementation:Ollama Ollama Llama KV Cache ISWA

Knowledge Sources	Ollama
Domains	LLM Inference, Memory Management
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the Interleaved Sliding Window Attention (ISWA) KV cache, which manages two separate KV caches for SWA and non-SWA model layers.

Description

The llama_kv_cache_iswa class creates two llama_kv_cache instances: one for non-SWA (dense attention) layers and one for SWA layers, using layer filter callbacks to route layers to the appropriate cache. The SWA cache is sized based on the sliding window size plus ubatch padding, enabling significant memory savings. All llama_memory_i interface methods (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, etc.) delegate to both underlying caches.

Usage

Used for models that mix dense and sliding window attention layers, such as Gemma 2 and Mistral. The ISWA cache automatically manages two separate caches so that SWA layers use a smaller cache while dense layers retain full attention history.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-kv-cache-iswa.cpp
Lines: 1-328

Signature

llama_kv_cache_iswa::llama_kv_cache_iswa(
        const llama_model & model,
                ggml_type   type_k,
                ggml_type   type_v,
                     bool   v_trans,
                     bool   offload,
                     bool   swa_full,
                     bool   unified,
                 uint32_t   kv_size,
                 uint32_t   n_seq_max,
                 uint32_t   n_ubatch,
                 uint32_t   n_pad,
    const layer_filter_cb & filter,
    const  layer_reuse_cb & reuse);

void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
llama_memory_context_ptr llama_kv_cache_iswa::init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);

Import

#include "llama-kv-cache-iswa.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	The model whose layers determine SWA vs non-SWA routing
type_k	ggml_type	Yes	Data type for key tensors
type_v	ggml_type	Yes	Data type for value tensors
v_trans	bool	Yes	Whether value tensors are transposed
swa_full	bool	Yes	Use full-size SWA cache (same size as base)
kv_size	uint32_t	Yes	Total KV cache size in cells

Outputs

Name	Type	Description
kv_base	llama_kv_cache*	The non-SWA (dense) KV cache
kv_swa	llama_kv_cache*	The SWA (sliding window) KV cache

Usage Examples

// ISWA cache is created internally by llama_model::init_memory()
// Access the sub-caches:
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa  = iswa_cache->get_swa();

// All sequence operations delegate to both caches
iswa_cache->seq_rm(seq_id, p0, p1);
iswa_cache->clear(true);

Related Pages

Principle:Ollama_Ollama_LLM_Memory_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment