Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama KVCache Encoder Caching

From Leeroopedia
Knowledge Sources
Domains KV Cache, Encoder-Decoder
Last Updated 2025-02-15 00:00 GMT

Overview

Encoder KV Caching is the principle of caching key-value projections produced by the encoder component of encoder-decoder transformer models. Unlike decoder-only (causal) caching where the cache grows with each generated token, encoder caches are computed once during the encoding phase and remain static throughout the entire decoding process, as the encoder output does not change during generation.

Core Concepts

Encoder-Decoder Architecture

Encoder-decoder models (such as T5, BART, Whisper, and mBART) consist of two distinct transformer stacks: an encoder that processes the input sequence (text, audio, or other modalities) into a sequence of hidden representations, and a decoder that autoregressively generates the output sequence while attending to the encoder's output. The decoder uses cross-attention layers to query the encoder's output, where the keys and values come from the encoder and the queries come from the decoder. This cross-attention mechanism is the primary consumer of the encoder KV cache.

Static vs. Dynamic Caching

The fundamental distinction between encoder and decoder caching is that encoder caches are static: they are computed once when the input is encoded and remain unchanged for the entire decoding process. Decoder caches are dynamic: they grow by one entry per layer per generation step. This distinction has important implications for memory management. Encoder cache memory can be allocated once and is fully utilized immediately, while decoder cache memory must be pre-allocated to the maximum output length but fills gradually. The static nature of encoder caches also means they never require defragmentation or ring-buffer management.

Cross-Attention Cache Lookup

During each decoder step, cross-attention layers look up keys and values from the encoder cache rather than from the decoder's own previous outputs. This lookup is indexed by encoder position (not decoder position), and every decoder token attends to every encoder position (there is no causal mask on the encoder side). The cross-attention pattern means that the encoder cache is read many times (once per decoder step per cross-attention layer) but written only once, making read performance the primary optimization target.

Multi-Modal Encoder Caching

In multi-modal models (e.g., vision-language models that combine an image encoder with a text decoder), the encoder cache may hold representations from non-textual modalities. An image encoder might produce a fixed-length sequence of visual tokens whose key-value projections are cached and attended to by the text decoder. The cache must accommodate different representation sizes and potentially different precision requirements for different modalities.

Implementation Notes

In the Ollama codebase, encoder KV caching is implemented for encoder-decoder models such as Whisper (speech recognition) and T5-family models. The encoder cache allocates memory for the full encoder output during the encoding phase and makes it available to the decoder's cross-attention layers throughout generation. The implementation stores encoder key-value projections in contiguous tensors indexed by layer and position. Since encoder caches are static, the implementation does not include defragmentation or eviction logic, but it does support cleanup when the associated sequence completes. The encoder cache is managed through the same abstract cache interface as causal caches, allowing the inference pipeline to handle encoder-decoder models without special-casing.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment