Principle:Ggml org Llama cpp Inference Context Creation

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	KV Cache Allocation, Batch Buffers, Thread Pools, Inference State Management	2026-02-14

Overview

Description

Inference Context Creation is the step in the llama.cpp text generation pipeline that allocates and configures all runtime state required to perform inference with a loaded model. While the model object holds the static weights, the context holds the dynamic state: the key-value (KV) cache that stores attention history, the batch processing buffers, thread pool configuration, compute graph scheduling state, and performance counters.

A single model can be shared across multiple inference contexts, enabling concurrent or batched generation from the same weights without duplicating multi-gigabyte weight tensors in memory.

Usage

Context creation is performed after model loading and before any tokenization, decoding, or sampling. The caller configures the context through a parameters struct that controls the context window size, batch dimensions, threading, KV cache data types, and Flash Attention settings.

llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx   = 2048;   // context window size
ctx_params.n_batch = 512;    // max tokens per decode call
ctx_params.n_threads = 8;    // generation threads

llama_context * ctx = llama_init_from_model(model, ctx_params);

Theoretical Basis

The KV Cache

The key-value cache is the single largest runtime memory allocation in transformer inference. During autoregressive text generation, each new token must attend to all previous tokens in the sequence. Without caching, this would require recomputing the key and value projections for every previous token at every generation step, resulting in O(n^2) total work for generating n tokens.

The KV cache solves this by storing the key and value tensors from each layer's attention computation. When generating the next token, only the new token's query needs to be computed; the keys and values for all previous positions are retrieved from the cache. This reduces per-step computation from O(n) to O(1) for the projection step (though the attention dot-product itself remains O(n)).

Cache sizing: The KV cache size is determined by:

n_ctx -- the maximum sequence length (number of token positions)
n_embd_head_k, n_embd_head_v -- the per-head key and value dimensions
n_head_kv -- the number of key-value heads (may be fewer than query heads in GQA models)
n_layer -- the number of transformer layers
type_k, type_v -- the data types used for cache storage

Cache quantization: The type_k and type_v parameters allow the KV cache to use quantized data types (e.g., FP16, Q8_0, Q4_0) instead of FP32. This can dramatically reduce memory usage -- a Q8_0 KV cache uses roughly half the memory of FP16, and Q4_0 uses roughly a quarter. However, quantized V caches require Flash Attention to be enabled.

Batch Buffers

The context allocates internal buffers for processing token batches. Two size parameters control this:

n_batch -- the logical maximum batch size. This is the maximum number of tokens that can be submitted in a single llama_decode call.
n_ubatch -- the physical maximum micro-batch size. Internally, a large batch may be split into smaller micro-batches (ubatches) for processing. This controls the maximum size of each micro-batch and thus the peak memory usage during computation.

The distinction between logical and physical batch sizes allows users to submit large batches for prompt processing while keeping peak GPU memory usage bounded by the ubatch size.

Thread Pools

The context manages thread parallelism for CPU-side computation through two settings:

n_threads -- the number of threads used during generation (processing a single token). For autoregressive generation, the computation per step is relatively small, and excessive threading can introduce overhead.
n_threads_batch -- the number of threads used during batch processing (processing multiple tokens, e.g., prompt encoding). Batch operations have more parallelizable work and benefit from higher thread counts.

Flash Attention

The flash_attn_type parameter controls whether Flash Attention is used for the self-attention computation. Flash Attention is a memory-efficient attention algorithm that:

Computes attention in tiles, avoiding the materialization of the full n x n attention matrix
Reduces memory usage from O(n^2) to O(n) for the attention computation
Is required when using quantized V cache types
May provide speed improvements on supported hardware

The setting LLAMA_FLASH_ATTN_TYPE_AUTO lets the system decide based on model architecture and cache type compatibility.

Sequence Management

The n_seq_max parameter controls the maximum number of independent sequences that can be processed concurrently within a single context. This is used for:

Parallel generation -- generating multiple independent completions simultaneously
Beam search -- maintaining multiple candidate sequences during decoding
Recurrent models -- maintaining distinct hidden states for different sequences

Each sequence has its own position tracking and can be independently managed (shifted, copied, removed) within the KV cache.

Compute Graph Scheduling

The context internally creates a backend scheduler (ggml_backend_sched) that manages the distribution of compute operations across available backends. When the model has layers on both CPU and GPU, the scheduler handles:

Partitioning the compute graph so that each operation runs on the backend where its input tensors reside
Inserting data copy operations where tensors must be transferred between backends
Managing scratch buffers for intermediate computation results

Related Pages

Implementation:Ggml_org_Llama_cpp_Llama_Init_From_Model
Principle:Ggml_org_Llama_cpp_GGUF_Model_Loading -- model must be loaded before creating a context
Principle:Ggml_org_Llama_cpp_Batch_Decoding -- the context is used for batch decoding
Heuristic:Ggml_org_Llama_cpp_Context_Size_Alignment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment