Implementation:Ollama Ollama Llama Memory Recurrent

Knowledge Sources	Ollama
Domains	LLM Inference, Memory Management
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the recurrent state memory system for SSM-based models (Mamba, RWKV), managing per-layer recurrent state and short convolution cache tensors.

Description

The constructor allocates per-layer R (recurrent state) and S (short convolution) tensors with appropriate backend buffer types. Implements find_slot for placing batches into memory cells based on sequence IDs with slot eviction when full. Manages cell metadata (sequence IDs, positions, source tracking) for state routing during inference. The llama_memory_recurrent_context class manages batch-level state and provides graph input preparation.

Usage

Used for recurrent/SSM architectures (Mamba, RWKV, Griffin) that use running hidden states instead of KV caches. These models have O(1) memory per step rather than O(n) like attention, making them efficient for very long contexts.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-memory-recurrent.cpp
Lines: 1-1167

Signature

llama_memory_recurrent::llama_memory_recurrent(
        const llama_model & model,
                ggml_type   type_r,
                ggml_type   type_s,
                     bool   offload,
                 uint32_t   mem_size,
                 uint32_t   n_seq_max,
    const layer_filter_cb & filter);

void llama_memory_recurrent::clear(bool data);
bool llama_memory_recurrent::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1);
bool llama_memory_recurrent::find_slot(const llama_ubatch & ubatch);
bool llama_memory_recurrent::prepare(const std::vector<llama_ubatch> & ubatches);

Import

#include "llama-memory-recurrent.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model &	Yes	Model providing layer config and device info
type_r	ggml_type	Yes	Data type for recurrent state tensors
type_s	ggml_type	Yes	Data type for convolution state tensors
mem_size	uint32_t	Yes	Total number of memory cells
n_seq_max	uint32_t	Yes	Maximum concurrent sequences

Outputs

Name	Type	Description
r_l	std::vector<ggml_tensor*>	Per-layer recurrent state tensors
s_l	std::vector<ggml_tensor*>	Per-layer convolution state tensors

Usage Examples

// Created internally for Mamba/RWKV models
auto mem = std::make_unique<llama_memory_recurrent>(
    model, type_r, type_s, offload, mem_size, n_seq_max, filter);

// Prepare ubatches
bool ok = mem->prepare(ubatches);

// Access state tensors
ggml_tensor * r = ctx->get_r_l(il);
ggml_tensor * s = ctx->get_s_l(il);

Related Pages

Principle:Ollama_Ollama_LLM_Memory_Architecture

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment