Heuristic:Ggml org Llama cpp Context Size Alignment
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Management |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Context size (n_ctx) is automatically padded to 256-token boundaries for performance, and should match the model's training context to avoid quality issues.
Description
The context size parameter controls how many tokens the model can process in a single session. llama.cpp automatically pads this value to 256-token boundaries for backend performance optimization. When the context is smaller than the model's training context, the full model capacity is underutilized. When it exceeds the training context, outputs may degrade unless RoPE scaling (e.g., YARN) is configured. Understanding these alignment rules prevents unexpected memory allocation and context warnings.
Usage
Use this heuristic when configuring context size (-c) for inference, especially when encountering context-related warnings, when memory budgets are tight, or when needing to extend context beyond the model's training length.
The Insight (Rule of Thumb)
- Action 1: Set
n_ctx = 0(default) to use the model's training context size automatically. - Action 2: When setting a custom context, be aware it will be padded up to the next 256-token boundary.
- Action 3: If
n_ctx < n_ctx_train, you will see a warning and the model's full capacity is not used. - Action 4: If
n_ctx > n_ctx_train, configure RoPE scaling parameters to avoid quality degradation. - Value: Minimum context when using
--fitis 4096 tokens. - Trade-off: Larger context uses more KV cache memory but allows processing longer conversations/documents.
Reasoning
The 256-token padding is applied at context initialization from src/llama-context.cpp:172:
// ref: https://github.com/ggml-org/llama.cpp/pull/17046#discussion_r2503085732
cparams.n_ctx = GGML_PAD(cparams.n_ctx, 256);
Context divisibility handling from src/llama-context.cpp:174-188:
if (cparams.kv_unified) {
cparams.n_ctx_seq = cparams.n_ctx;
} else {
cparams.n_ctx_seq = cparams.n_ctx / cparams.n_seq_max;
cparams.n_ctx_seq = GGML_PAD(cparams.n_ctx_seq, 256);
if (cparams.n_ctx != cparams.n_ctx_seq * cparams.n_seq_max) {
cparams.n_ctx = cparams.n_ctx_seq * cparams.n_seq_max;
LLAMA_LOG_WARN("%s: n_ctx is not divisible by n_seq_max"
" - rounding down to %u\n", __func__, cparams.n_ctx);
}
}
Context vs. training context warnings from src/llama-context.cpp:
// When context is smaller than training context:
LLAMA_LOG_WARN("n_ctx_seq (%u) < n_ctx_train (%u)"
" -- the full capacity of the model will not be utilized\n",
n_ctx_seq, hparams.n_ctx_train);
// When context exceeds training context:
LLAMA_LOG_WARN("n_ctx_seq (%u) > n_ctx_train (%u)"
" -- possible training context overflow\n",
n_ctx_seq, hparams.n_ctx_train);
The minimum context for memory fitting from common/common.h:386:
int32_t fit_params_min_ctx = 4096; // minimum context size when trying to reduce memory use