Implementation:Ollama Ollama Llama KV Cache ISWA
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Memory Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the Interleaved Sliding Window Attention (ISWA) KV cache, which manages two separate KV caches for SWA and non-SWA model layers.
Description
The llama_kv_cache_iswa class creates two llama_kv_cache instances: one for non-SWA (dense attention) layers and one for SWA layers, using layer filter callbacks to route layers to the appropriate cache. The SWA cache is sized based on the sliding window size plus ubatch padding, enabling significant memory savings. All llama_memory_i interface methods (seq_rm, seq_cp, seq_keep, seq_add, seq_div, clear, etc.) delegate to both underlying caches.
Usage
Used for models that mix dense and sliding window attention layers, such as Gemma 2 and Mistral. The ISWA cache automatically manages two separate caches so that SWA layers use a smaller cache while dense layers retain full attention history.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/llama-kv-cache-iswa.cpp - Lines: 1-328
Signature
llama_kv_cache_iswa::llama_kv_cache_iswa(
const llama_model & model,
ggml_type type_k,
ggml_type type_v,
bool v_trans,
bool offload,
bool swa_full,
bool unified,
uint32_t kv_size,
uint32_t n_seq_max,
uint32_t n_ubatch,
uint32_t n_pad,
const layer_filter_cb & filter,
const layer_reuse_cb & reuse);
void llama_kv_cache_iswa::clear(bool data);
bool llama_kv_cache_iswa::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
void llama_kv_cache_iswa::seq_cp(llama_seq_id seq_id_src, llama_seq_id seq_id_dst, llama_pos p0, llama_pos p1);
llama_memory_context_ptr llama_kv_cache_iswa::init_batch(llama_batch_allocr & balloc, uint32_t n_ubatch, bool embd_all);
Import
#include "llama-kv-cache-iswa.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | The model whose layers determine SWA vs non-SWA routing |
| type_k | ggml_type | Yes | Data type for key tensors |
| type_v | ggml_type | Yes | Data type for value tensors |
| v_trans | bool | Yes | Whether value tensors are transposed |
| swa_full | bool | Yes | Use full-size SWA cache (same size as base) |
| kv_size | uint32_t | Yes | Total KV cache size in cells |
Outputs
| Name | Type | Description |
|---|---|---|
| kv_base | llama_kv_cache* | The non-SWA (dense) KV cache |
| kv_swa | llama_kv_cache* | The SWA (sliding window) KV cache |
Usage Examples
// ISWA cache is created internally by llama_model::init_memory()
// Access the sub-caches:
llama_kv_cache * base = iswa_cache->get_base();
llama_kv_cache * swa = iswa_cache->get_swa();
// All sequence operations delegate to both caches
iswa_cache->seq_rm(seq_id, p0, p1);
iswa_cache->clear(true);