Implementation:Ggml org Llama cpp Llama Memory Seq Pos Max

Aspect	Detail
Implementation Name	Llama Memory Seq Pos Max
Doc Type	API Doc
Category	Memory Management
Workflow	Interactive_Chat
Applies To	llama.cpp
Status	Active

Overview

Description

The llama_memory_seq_pos_max function returns the largest token position currently stored in the KV cache for a given sequence. The llama_n_ctx function returns the total context window size of a context. Together, these two functions enable applications to monitor KV cache utilization and detect context overflow before it occurs. They form the foundation of context window management in llama.cpp chat applications.

Usage

These functions are called before every llama_decode call during the generation loop. By comparing llama_memory_seq_pos_max(mem, 0) + 1 + batch.n_tokens against llama_n_ctx(ctx), the application determines whether there is sufficient space to process the next batch. They are also used to detect whether the context is empty (first turn), which affects BOS token handling during tokenization.

Code Reference

Attribute	Value
Source Location (memory_seq_pos_max)	`include/llama.h:747-749`
Source Location (n_ctx)	`include/llama.h:517`
Related Functions	`llama_get_memory()`, `llama_memory_seq_pos_min()`, `llama_memory_can_shift()`
Import	`#include "llama.h"`

Signatures:

// Returns the largest position present in the memory for the specified sequence
// Note that all positions in the range [pos_min, pos_max] are guaranteed to be present
// Returns -1 if the sequence is empty
llama_pos llama_memory_seq_pos_max(llama_memory_t mem, llama_seq_id seq_id);

// Returns the total context window size
uint32_t llama_n_ctx(const struct llama_context * ctx);

// Get the memory handle from a context
llama_memory_t llama_get_memory(const struct llama_context * ctx);

// Related memory query functions
llama_pos llama_memory_seq_pos_min(llama_memory_t mem, llama_seq_id seq_id);
bool llama_memory_can_shift(llama_memory_t mem);

Type definitions:

typedef int32_t llama_pos;
typedef int32_t llama_seq_id;
typedef struct llama_memory_i * llama_memory_t;

I/O Contract

llama_memory_seq_pos_max:

Direction	Name	Type	Description
Input	mem	`llama_memory_t`	Memory handle obtained from `llama_get_memory(ctx)`
Input	seq_id	`llama_seq_id`	Sequence identifier (typically 0 for single-sequence chat)
Output	return	`llama_pos`	Largest position in cache, or -1 if the sequence is empty

llama_n_ctx:

Direction	Name	Type	Description
Input	ctx	`const struct llama_context *`	The inference context
Output	return	`uint32_t`	Total context window size in tokens

Preconditions:

The context must be initialized (non-NULL)
The memory handle must be obtained from a valid context via llama_get_memory(ctx)

Postconditions:

Return value of -1 from llama_memory_seq_pos_max indicates an empty sequence (no tokens processed yet)
Return value >= 0 indicates the highest occupied position; positions 0 through this value are all present
llama_n_ctx always returns the value set during context creation (or the model default if 0 was specified)

Invariants:

All positions in [llama_memory_seq_pos_min(mem, seq_id), llama_memory_seq_pos_max(mem, seq_id)] are guaranteed to be present in memory
The number of occupied positions is pos_max - pos_min + 1 (or 0 if the sequence is empty)

Usage Examples

Context overflow detection (from simple-chat):

// Before each llama_decode call in the generation loop
int n_ctx = llama_n_ctx(ctx);
int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
if (n_ctx_used + batch.n_tokens > n_ctx) {
    printf("\033[0m\n");
    fprintf(stderr, "context size exceeded\n");
    exit(0);
}

First-turn detection (from simple-chat):

// Returns true if no tokens have been processed yet for sequence 0
const bool is_first = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) == -1;

// Used when tokenizing to control BOS token insertion
llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens, n_tokens, is_first, true);

Computing remaining context capacity:

int n_ctx = llama_n_ctx(ctx);
llama_memory_t mem = llama_get_memory(ctx);
int pos_max = llama_memory_seq_pos_max(mem, 0);
int n_used = (pos_max == -1) ? 0 : pos_max + 1;
int n_remaining = n_ctx - n_used;

if (n_remaining < min_required_tokens) {
    // Implement context shifting or conversation pruning
}

Context shifting pattern (advanced):

// Check if memory supports shifting
if (llama_memory_can_shift(mem)) {
    int n_discard = n_ctx / 4;  // discard oldest quarter
    // Remove positions [0, n_discard)
    llama_memory_seq_rm(mem, 0, 0, n_discard);
    // Shift remaining positions back by n_discard
    llama_memory_seq_add(mem, 0, n_discard, -1, -n_discard);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment