Implementation:Ollama Ollama Llama Memory Recurrent
| Knowledge Sources | |
|---|---|
| Domains | LLM Inference, Memory Management |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the recurrent state memory system for SSM-based models (Mamba, RWKV), managing per-layer recurrent state and short convolution cache tensors.
Description
The constructor allocates per-layer R (recurrent state) and S (short convolution) tensors with appropriate backend buffer types. Implements find_slot for placing batches into memory cells based on sequence IDs with slot eviction when full. Manages cell metadata (sequence IDs, positions, source tracking) for state routing during inference. The llama_memory_recurrent_context class manages batch-level state and provides graph input preparation.
Usage
Used for recurrent/SSM architectures (Mamba, RWKV, Griffin) that use running hidden states instead of KV caches. These models have O(1) memory per step rather than O(n) like attention, making them efficient for very long contexts.
Code Reference
Source Location
- Repository: Ollama
- File:
llama/llama.cpp/src/llama-memory-recurrent.cpp - Lines: 1-1167
Signature
llama_memory_recurrent::llama_memory_recurrent(
const llama_model & model,
ggml_type type_r,
ggml_type type_s,
bool offload,
uint32_t mem_size,
uint32_t n_seq_max,
const layer_filter_cb & filter);
void llama_memory_recurrent::clear(bool data);
bool llama_memory_recurrent::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
bool llama_memory_recurrent::find_slot(const llama_ubatch & ubatch);
bool llama_memory_recurrent::prepare(const std::vector<llama_ubatch> & ubatches);
Import
#include "llama-memory-recurrent.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model & | Yes | Model providing layer config and device info |
| type_r | ggml_type | Yes | Data type for recurrent state tensors |
| type_s | ggml_type | Yes | Data type for convolution state tensors |
| mem_size | uint32_t | Yes | Total number of memory cells |
| n_seq_max | uint32_t | Yes | Maximum concurrent sequences |
Outputs
| Name | Type | Description |
|---|---|---|
| r_l | std::vector<ggml_tensor*> | Per-layer recurrent state tensors |
| s_l | std::vector<ggml_tensor*> | Per-layer convolution state tensors |
Usage Examples
// Created internally for Mamba/RWKV models
auto mem = std::make_unique<llama_memory_recurrent>(
model, type_r, type_s, offload, mem_size, n_seq_max, filter);
// Prepare ubatches
bool ok = mem->prepare(ubatches);
// Access state tensors
ggml_tensor * r = ctx->get_r_l(il);
ggml_tensor * s = ctx->get_s_l(il);