Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Memory Seq Pos Max

From Leeroopedia
Aspect Detail
Implementation Name Llama Memory Seq Pos Max
Doc Type API Doc
Category Memory Management
Workflow Interactive_Chat
Applies To llama.cpp
Status Active

Overview

Description

The llama_memory_seq_pos_max function returns the largest token position currently stored in the KV cache for a given sequence. The llama_n_ctx function returns the total context window size of a context. Together, these two functions enable applications to monitor KV cache utilization and detect context overflow before it occurs. They form the foundation of context window management in llama.cpp chat applications.

Usage

These functions are called before every llama_decode call during the generation loop. By comparing llama_memory_seq_pos_max(mem, 0) + 1 + batch.n_tokens against llama_n_ctx(ctx), the application determines whether there is sufficient space to process the next batch. They are also used to detect whether the context is empty (first turn), which affects BOS token handling during tokenization.

Code Reference

Attribute Value
Source Location (memory_seq_pos_max) include/llama.h:747-749
Source Location (n_ctx) include/llama.h:517
Related Functions llama_get_memory(), llama_memory_seq_pos_min(), llama_memory_can_shift()
Import #include "llama.h"

Signatures:

// Returns the largest position present in the memory for the specified sequence
// Note that all positions in the range [pos_min, pos_max] are guaranteed to be present
// Returns -1 if the sequence is empty
llama_pos llama_memory_seq_pos_max(llama_memory_t mem, llama_seq_id seq_id);

// Returns the total context window size
uint32_t llama_n_ctx(const struct llama_context * ctx);

// Get the memory handle from a context
llama_memory_t llama_get_memory(const struct llama_context * ctx);

// Related memory query functions
llama_pos llama_memory_seq_pos_min(llama_memory_t mem, llama_seq_id seq_id);
bool llama_memory_can_shift(llama_memory_t mem);

Type definitions:

typedef int32_t llama_pos;
typedef int32_t llama_seq_id;
typedef struct llama_memory_i * llama_memory_t;

I/O Contract

llama_memory_seq_pos_max:

Direction Name Type Description
Input mem llama_memory_t Memory handle obtained from llama_get_memory(ctx)
Input seq_id llama_seq_id Sequence identifier (typically 0 for single-sequence chat)
Output return llama_pos Largest position in cache, or -1 if the sequence is empty

llama_n_ctx:

Direction Name Type Description
Input ctx const struct llama_context * The inference context
Output return uint32_t Total context window size in tokens

Preconditions:

  • The context must be initialized (non-NULL)
  • The memory handle must be obtained from a valid context via llama_get_memory(ctx)

Postconditions:

  • Return value of -1 from llama_memory_seq_pos_max indicates an empty sequence (no tokens processed yet)
  • Return value >= 0 indicates the highest occupied position; positions 0 through this value are all present
  • llama_n_ctx always returns the value set during context creation (or the model default if 0 was specified)

Invariants:

  • All positions in [llama_memory_seq_pos_min(mem, seq_id), llama_memory_seq_pos_max(mem, seq_id)] are guaranteed to be present in memory
  • The number of occupied positions is pos_max - pos_min + 1 (or 0 if the sequence is empty)

Usage Examples

Context overflow detection (from simple-chat):

// Before each llama_decode call in the generation loop
int n_ctx = llama_n_ctx(ctx);
int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
if (n_ctx_used + batch.n_tokens > n_ctx) {
    printf("\033[0m\n");
    fprintf(stderr, "context size exceeded\n");
    exit(0);
}

First-turn detection (from simple-chat):

// Returns true if no tokens have been processed yet for sequence 0
const bool is_first = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) == -1;

// Used when tokenizing to control BOS token insertion
llama_tokenize(vocab, prompt.c_str(), prompt.size(), tokens, n_tokens, is_first, true);

Computing remaining context capacity:

int n_ctx = llama_n_ctx(ctx);
llama_memory_t mem = llama_get_memory(ctx);
int pos_max = llama_memory_seq_pos_max(mem, 0);
int n_used = (pos_max == -1) ? 0 : pos_max + 1;
int n_remaining = n_ctx - n_used;

if (n_remaining < min_required_tokens) {
    // Implement context shifting or conversation pruning
}

Context shifting pattern (advanced):

// Check if memory supports shifting
if (llama_memory_can_shift(mem)) {
    int n_discard = n_ctx / 4;  // discard oldest quarter
    // Remove positions [0, n_discard)
    llama_memory_seq_rm(mem, 0, 0, n_discard);
    // Shift remaining positions back by n_discard
    llama_memory_seq_add(mem, 0, n_discard, -1, -n_discard);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment