Principle:Ollama Ollama KVCache Encoder Caching
| Knowledge Sources | |
|---|---|
| Domains | KV Cache, Encoder-Decoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Encoder KV Caching is the principle of caching key-value projections produced by the encoder component of encoder-decoder transformer models. Unlike decoder-only (causal) caching where the cache grows with each generated token, encoder caches are computed once during the encoding phase and remain static throughout the entire decoding process, as the encoder output does not change during generation.
Core Concepts
Encoder-Decoder Architecture
Encoder-decoder models (such as T5, BART, Whisper, and mBART) consist of two distinct transformer stacks: an encoder that processes the input sequence (text, audio, or other modalities) into a sequence of hidden representations, and a decoder that autoregressively generates the output sequence while attending to the encoder's output. The decoder uses cross-attention layers to query the encoder's output, where the keys and values come from the encoder and the queries come from the decoder. This cross-attention mechanism is the primary consumer of the encoder KV cache.
Static vs. Dynamic Caching
The fundamental distinction between encoder and decoder caching is that encoder caches are static: they are computed once when the input is encoded and remain unchanged for the entire decoding process. Decoder caches are dynamic: they grow by one entry per layer per generation step. This distinction has important implications for memory management. Encoder cache memory can be allocated once and is fully utilized immediately, while decoder cache memory must be pre-allocated to the maximum output length but fills gradually. The static nature of encoder caches also means they never require defragmentation or ring-buffer management.
Cross-Attention Cache Lookup
During each decoder step, cross-attention layers look up keys and values from the encoder cache rather than from the decoder's own previous outputs. This lookup is indexed by encoder position (not decoder position), and every decoder token attends to every encoder position (there is no causal mask on the encoder side). The cross-attention pattern means that the encoder cache is read many times (once per decoder step per cross-attention layer) but written only once, making read performance the primary optimization target.
Multi-Modal Encoder Caching
In multi-modal models (e.g., vision-language models that combine an image encoder with a text decoder), the encoder cache may hold representations from non-textual modalities. An image encoder might produce a fixed-length sequence of visual tokens whose key-value projections are cached and attended to by the text decoder. The cache must accommodate different representation sizes and potentially different precision requirements for different modalities.
Implementation Notes
In the Ollama codebase, encoder KV caching is implemented for encoder-decoder models such as Whisper (speech recognition) and T5-family models. The encoder cache allocates memory for the full encoder output during the encoding phase and makes it available to the decoder's cross-attention layers throughout generation. The implementation stores encoder key-value projections in contiguous tensors indexed by layer and position. Since encoder caches are static, the implementation does not include defragmentation or eviction logic, but it does support cleanup when the associated sequence completes. The encoder cache is managed through the same abstract cache interface as causal caches, allowing the inference pipeline to handle encoder-decoder models without special-casing.