Principle:Ggml org Llama cpp Context Window Management

Aspect	Detail
Principle Name	Context Window Management
Category	Memory Management
Workflow	Interactive_Chat
Applies To	llama.cpp
Status	Active

Overview

Description

Context Window Management is the principle of managing the finite key-value (KV) cache in long-running conversations. The KV cache stores the computed attention keys and values for all processed tokens and has a fixed size determined by the n_ctx parameter set during context initialization. As a multi-turn conversation progresses, each user message and assistant response consumes positions in the KV cache. When the cache is full, no further tokens can be processed, and the application must either terminate generation, implement a context shifting strategy, or discard earlier conversation turns.

Usage

Context window management is an ongoing concern throughout the lifetime of a chat session. Before every call to llama_decode, the application must check whether there is sufficient remaining space in the context window to accommodate the incoming batch of tokens. If the available space is insufficient, the application must take corrective action.

The two key query functions are:

llama_n_ctx(ctx): Returns the total context window size
llama_memory_seq_pos_max(mem, seq_id): Returns the highest token position currently stored in memory for a given sequence, which indicates how much of the context has been consumed

The available space is computed as: n_ctx - (llama_memory_seq_pos_max(mem, 0) + 1).

Theoretical Basis

KV cache fundamentals: In a transformer model, the self-attention mechanism requires access to the keys and values of all previous tokens at every generation step. Storing these precomputed keys and values (the "KV cache") avoids redundant computation but creates a fixed-size memory resource that must be managed. The cache size is proportional to n_ctx * n_layers * n_heads * d_head, and for large models at large context sizes, it can consume several gigabytes of memory.

Position tracking: Each token in the KV cache is associated with a position index. Positions are assigned sequentially as tokens are processed. The function llama_memory_seq_pos_max returns the maximum position index for a given sequence, effectively reporting how many token slots have been consumed. When this value plus the size of the next batch exceeds n_ctx, the context is full.

Overflow detection: The simplest approach (used in the simple-chat example) is to detect context overflow and terminate. This is implemented as a pre-decode check:

int n_ctx_used = llama_memory_seq_pos_max(llama_get_memory(ctx), 0) + 1;
if (n_ctx_used + batch.n_tokens > n_ctx) {
    // context is full
}

Advanced strategies: More sophisticated applications can handle context overflow without terminating:

Context shifting (KV cache shifting): Discard the oldest N tokens from the KV cache using llama_memory_seq_rm and shift remaining positions using llama_memory_seq_add with a negative delta. This preserves recent conversation context at the cost of losing older history. Only supported when llama_memory_can_shift(mem) returns true.

Conversation pruning: Remove earlier turns from the conversation history and re-process a summarized or truncated version.

Sequence management: For multi-user or branching conversation scenarios, different sequences can be managed independently using different seq_id values, each with their own position tracking via llama_memory_seq_pos_min and llama_memory_seq_pos_max.

Sequence position range guarantees: The API guarantees that all positions in the range [pos_min, pos_max] are present in memory for a given sequence. This invariant simplifies reasoning about cache state: the occupied portion is always a contiguous range.

Training context vs. runtime context: The model has a training context length (accessible via llama_model_n_ctx_train) that represents the maximum context the model was trained to handle. The runtime n_ctx can be set to any value, but attention quality may degrade for positions beyond the training context length. RoPE scaling techniques (linear, YaRN, LongRoPE) can extend effective context beyond the training length.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment