Heuristic:Ggml org Llama cpp Context Size Alignment

Knowledge Sources	llama.cpp
Domains	Optimization, Memory_Management
Last Updated	2026-02-14 22:00 GMT

Overview

Context size (n_ctx) is automatically padded to 256-token boundaries for performance, and should match the model's training context to avoid quality issues.

Description

The context size parameter controls how many tokens the model can process in a single session. llama.cpp automatically pads this value to 256-token boundaries for backend performance optimization. When the context is smaller than the model's training context, the full model capacity is underutilized. When it exceeds the training context, outputs may degrade unless RoPE scaling (e.g., YARN) is configured. Understanding these alignment rules prevents unexpected memory allocation and context warnings.

Usage

Use this heuristic when configuring context size (-c) for inference, especially when encountering context-related warnings, when memory budgets are tight, or when needing to extend context beyond the model's training length.

The Insight (Rule of Thumb)

Action 1: Set n_ctx = 0 (default) to use the model's training context size automatically.
Action 2: When setting a custom context, be aware it will be padded up to the next 256-token boundary.
Action 3: If n_ctx < n_ctx_train, you will see a warning and the model's full capacity is not used.
Action 4: If n_ctx > n_ctx_train, configure RoPE scaling parameters to avoid quality degradation.
Value: Minimum context when using --fit is 4096 tokens.
Trade-off: Larger context uses more KV cache memory but allows processing longer conversations/documents.

Reasoning

The 256-token padding is applied at context initialization from src/llama-context.cpp:172:

// ref: https://github.com/ggml-org/llama.cpp/pull/17046#discussion_r2503085732
cparams.n_ctx = GGML_PAD(cparams.n_ctx, 256);

Context divisibility handling from src/llama-context.cpp:174-188:

if (cparams.kv_unified) {
    cparams.n_ctx_seq = cparams.n_ctx;
} else {
    cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
    cparams.n_ctx_seq = GGML_PAD(cparams.n_ctx_seq, 256);
    if (cparams.n_ctx != cparams.n_ctx_seq * cparams.n_seq_max) {
        cparams.n_ctx = cparams.n_ctx_seq * cparams.n_seq_max;
        LLAMA_LOG_WARN("%s: n_ctx is not divisible by n_seq_max"
                       " - rounding down to %u\n", __func__, cparams.n_ctx);
    }
}

Context vs. training context warnings from src/llama-context.cpp:

// When context is smaller than training context:
LLAMA_LOG_WARN("n_ctx_seq (%u) < n_ctx_train (%u)"
    " -- the full capacity of the model will not be utilized\n",
    n_ctx_seq, hparams.n_ctx_train);

// When context exceeds training context:
LLAMA_LOG_WARN("n_ctx_seq (%u) > n_ctx_train (%u)"
    " -- possible training context overflow\n",
    n_ctx_seq, hparams.n_ctx_train);

The minimum context for memory fitting from common/common.h:386:

int32_t fit_params_min_ctx = 4096; // minimum context size when trying to reduce memory use

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment