Principle:Ggml org Llama cpp Inference Context Creation
| Knowledge Sources | Domains | Last Updated |
|---|---|---|
| ggml-org/llama.cpp | KV Cache Allocation, Batch Buffers, Thread Pools, Inference State Management | 2026-02-14 |
Overview
Description
Inference Context Creation is the step in the llama.cpp text generation pipeline that allocates and configures all runtime state required to perform inference with a loaded model. While the model object holds the static weights, the context holds the dynamic state: the key-value (KV) cache that stores attention history, the batch processing buffers, thread pool configuration, compute graph scheduling state, and performance counters.
A single model can be shared across multiple inference contexts, enabling concurrent or batched generation from the same weights without duplicating multi-gigabyte weight tensors in memory.
Usage
Context creation is performed after model loading and before any tokenization, decoding, or sampling. The caller configures the context through a parameters struct that controls the context window size, batch dimensions, threading, KV cache data types, and Flash Attention settings.
llama_context_params ctx_params = llama_context_default_params();
ctx_params.n_ctx = 2048; // context window size
ctx_params.n_batch = 512; // max tokens per decode call
ctx_params.n_threads = 8; // generation threads
llama_context * ctx = llama_init_from_model(model, ctx_params);
Theoretical Basis
The KV Cache
The key-value cache is the single largest runtime memory allocation in transformer inference. During autoregressive text generation, each new token must attend to all previous tokens in the sequence. Without caching, this would require recomputing the key and value projections for every previous token at every generation step, resulting in O(n^2) total work for generating n tokens.
The KV cache solves this by storing the key and value tensors from each layer's attention computation. When generating the next token, only the new token's query needs to be computed; the keys and values for all previous positions are retrieved from the cache. This reduces per-step computation from O(n) to O(1) for the projection step (though the attention dot-product itself remains O(n)).
Cache sizing: The KV cache size is determined by:
n_ctx-- the maximum sequence length (number of token positions)n_embd_head_k,n_embd_head_v-- the per-head key and value dimensionsn_head_kv-- the number of key-value heads (may be fewer than query heads in GQA models)n_layer-- the number of transformer layerstype_k,type_v-- the data types used for cache storage
Cache quantization: The type_k and type_v parameters allow the KV cache to use quantized data types (e.g., FP16, Q8_0, Q4_0) instead of FP32. This can dramatically reduce memory usage -- a Q8_0 KV cache uses roughly half the memory of FP16, and Q4_0 uses roughly a quarter. However, quantized V caches require Flash Attention to be enabled.
Batch Buffers
The context allocates internal buffers for processing token batches. Two size parameters control this:
n_batch-- the logical maximum batch size. This is the maximum number of tokens that can be submitted in a singlellama_decodecall.n_ubatch-- the physical maximum micro-batch size. Internally, a large batch may be split into smaller micro-batches (ubatches) for processing. This controls the maximum size of each micro-batch and thus the peak memory usage during computation.
The distinction between logical and physical batch sizes allows users to submit large batches for prompt processing while keeping peak GPU memory usage bounded by the ubatch size.
Thread Pools
The context manages thread parallelism for CPU-side computation through two settings:
n_threads-- the number of threads used during generation (processing a single token). For autoregressive generation, the computation per step is relatively small, and excessive threading can introduce overhead.n_threads_batch-- the number of threads used during batch processing (processing multiple tokens, e.g., prompt encoding). Batch operations have more parallelizable work and benefit from higher thread counts.
Flash Attention
The flash_attn_type parameter controls whether Flash Attention is used for the self-attention computation. Flash Attention is a memory-efficient attention algorithm that:
- Computes attention in tiles, avoiding the materialization of the full n x n attention matrix
- Reduces memory usage from O(n^2) to O(n) for the attention computation
- Is required when using quantized V cache types
- May provide speed improvements on supported hardware
The setting LLAMA_FLASH_ATTN_TYPE_AUTO lets the system decide based on model architecture and cache type compatibility.
Sequence Management
The n_seq_max parameter controls the maximum number of independent sequences that can be processed concurrently within a single context. This is used for:
- Parallel generation -- generating multiple independent completions simultaneously
- Beam search -- maintaining multiple candidate sequences during decoding
- Recurrent models -- maintaining distinct hidden states for different sequences
Each sequence has its own position tracking and can be independently managed (shifted, copied, removed) within the KV cache.
Compute Graph Scheduling
The context internally creates a backend scheduler (ggml_backend_sched) that manages the distribution of compute operations across available backends. When the model has layers on both CPU and GPU, the scheduler handles:
- Partitioning the compute graph so that each operation runs on the backend where its input tensors reside
- Inserting data copy operations where tensors must be transferred between backends
- Managing scratch buffers for intermediate computation results
Related Pages
- Implementation:Ggml_org_Llama_cpp_Llama_Init_From_Model
- Principle:Ggml_org_Llama_cpp_GGUF_Model_Loading -- model must be loaded before creating a context
- Principle:Ggml_org_Llama_cpp_Batch_Decoding -- the context is used for batch decoding
- Heuristic:Ggml_org_Llama_cpp_Context_Size_Alignment