Principle:Ollama Ollama LLM Memory Architecture
| Knowledge Sources | |
|---|---|
| Domains | Memory Architecture, State Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
LLM Memory Architecture is the principle of managing the runtime state required by language model inference, encompassing both attention-based KV caches and recurrent hidden states. Different model architectures require fundamentally different memory patterns, and the memory architecture must abstract over these differences while optimizing for each pattern's unique access characteristics.
Core Concepts
Attention-Based Memory
Standard transformer models store their inference state as key-value (KV) caches in the attention layers. This memory grows linearly with sequence length: each new token adds one key vector and one value vector per layer per attention head. For a model with 32 layers, 32 heads, and 128-dimensional heads, each token adds 32 * 32 * 128 * 2 * 2 bytes (at FP16) = 512 KB to the cache. A 4096-token context thus requires approximately 2 GB of KV cache memory. This linear growth is the primary memory bottleneck for long-context inference.
Recurrent State Memory
Recurrent models (Mamba, RWKV, Griffin, xLSTM) maintain fixed-size hidden states that are updated at each timestep rather than appending to a growing cache. The state size is independent of sequence length, determined only by the model's architecture (state dimension, number of layers). This constant memory footprint enables these models to handle arbitrarily long sequences without increasing memory usage. However, recurrent states are typically not shareable between sequences and must be maintained independently for each active inference request.
Hybrid Memory Management
Hybrid models combine attention layers with recurrent layers, requiring the memory architecture to manage both growing KV caches and fixed-size recurrent states simultaneously. The memory manager must allocate and track both types of state per sequence, coordinate their lifecycles (both must be created when a sequence starts and freed when it ends), and handle the different performance characteristics of each type. Hybrid memory is more complex than either pure attention or pure recurrent memory because the two state types have fundamentally different growth patterns, access patterns, and optimization strategies.
Memory Lifecycle
The memory lifecycle encompasses allocation, population, usage, and deallocation of inference state. Allocation occurs when a new inference request begins, reserving space for KV cache entries and/or recurrent state. Population occurs during the prefill phase when the prompt is processed, filling the cache with initial state. Usage occurs during the decode phase, with each generation step reading existing state and writing new state. Deallocation occurs when the request completes or is evicted. In multi-tenant serving environments, the memory lifecycle must support concurrent sequences with independent lifecycles sharing a common memory pool.
Memory Optimization Techniques
Several techniques reduce the memory footprint of LLM state. KV cache quantization stores cached values in lower precision (FP8, INT8, INT4) at the cost of slight accuracy degradation. Grouped-query attention (GQA) reduces the number of KV heads relative to query heads, linearly reducing cache size. Token pruning removes cached entries for tokens deemed unimportant by attention patterns. Paged attention manages cache memory in fixed-size pages (similar to virtual memory) to eliminate fragmentation. These techniques can be combined to achieve significant memory savings, enabling longer contexts and higher concurrency on the same hardware.
Implementation Notes
In the Ollama codebase, memory architecture is managed through an abstract memory interface that supports both attention-based and recurrent state patterns. The causal memory implementation manages standard KV caches for pure transformer models, handling allocation, growth, defragmentation, and deallocation. The recurrent memory implementation manages fixed-size state buffers for SSM-based models. The hybrid memory implementation composes both, routing layer-level operations to the appropriate sub-system based on whether each layer is an attention layer or a recurrent layer. Memory allocation decisions are informed by the hardware discovery system, which determines available VRAM on each device and guides the scheduler in deciding how many concurrent sequences can be served.