Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Batch Decoding

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp Transformer Forward Pass, Self-Attention, Feed-Forward Networks, Logit Generation, KV Cache 2026-02-14

Overview

Description

Batch Decoding is the computational core of the llama.cpp text generation pipeline. It performs a forward pass through the transformer model for a batch of input tokens, computing self-attention across all layers, applying feed-forward transformations, and producing output logits (unnormalized probability scores) over the vocabulary for each specified output position. These logits are then consumed by the sampling stage to select the next token.

The "batch" in batch decoding refers to the ability to process multiple tokens in a single forward pass. During prompt processing (prefill), the entire prompt can be submitted as one batch, enabling parallelized computation across all prompt tokens. During autoregressive generation, each step processes a single new token but can update the KV cache and produce logits efficiently by reusing cached keys and values from previous steps.

Usage

Batch decoding is called in a loop during text generation. On the first iteration, the full prompt is processed. On subsequent iterations, the most recently sampled token is processed one at a time.

// Process the prompt as a batch
llama_batch batch = llama_batch_get_one(prompt_tokens, n_prompt_tokens);
llama_decode(ctx, batch);

// Autoregressive generation loop
for (int i = 0; i < n_predict; i++) {
    llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
    if (llama_vocab_is_eog(vocab, new_token)) break;

    batch = llama_batch_get_one(&new_token, 1);
    llama_decode(ctx, batch);
}

Theoretical Basis

The Transformer Forward Pass

A transformer model processes tokens through a stack of identical layers. Each layer consists of two main sub-layers:

1. Multi-Head Self-Attention

Self-attention allows each token to gather information from all other tokens in the sequence. For each token position, the mechanism computes:

  • Query (Q) = input x W_Q -- what this token is looking for
  • Key (K) = input x W_K -- what this token offers to other tokens
  • Value (V) = input x W_V -- the information this token contributes

The attention score between positions i and j is:

Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V

where d_k is the dimension of the key vectors. The softmax ensures attention weights sum to 1, and the sqrt(d_k) scaling prevents the dot products from growing too large in magnitude.

Multi-head attention runs this computation in parallel across multiple "heads", each with its own learned projection matrices. The outputs are concatenated and projected back to the model dimension.

Grouped-Query Attention (GQA): Modern models like LLaMA use fewer key-value heads than query heads (e.g., 8 KV heads for 32 query heads). Multiple query heads share the same key-value head, reducing KV cache memory by the grouping factor while maintaining model quality.

2. Feed-Forward Network (FFN)

After attention, each token's representation passes through a position-wise feed-forward network, typically using the SwiGLU activation:

FFN(x) = (xW_1 * SiLU(xW_gate)) * W_2

This nonlinear transformation is applied independently to each token position and expands the representation to a wider intermediate dimension before projecting back.

Layer Normalization: Each sub-layer is preceded (pre-norm) or followed (post-norm) by RMSNorm, which normalizes the hidden states to stabilize training and inference.

KV Cache and Incremental Decoding

During autoregressive generation, the model produces one token at a time. Naive recomputation would require running the full sequence through every layer at each step, giving O(n^2 * L) total work for generating n tokens across L layers.

The KV cache optimization works as follows:

  1. Prompt processing (prefill): Compute Q, K, V for all prompt tokens. Store K and V in the cache. Use Q, K, V to compute attention and produce logits for the last position.
  2. Token generation (decode): For each new token, compute only that token's Q, K, V. Append the new K, V to the cache. Compute attention using the new Q against all cached K, V values.

This reduces per-step computation from O(n * d^2) (for recomputing all projections) to O(d^2 + n * d) (one projection plus one attention pass). The KV cache trades memory for computation.

Batch Processing and Micro-Batches

The llama_batch structure supports flexible batch configurations:

  • Token batch: A list of token IDs to process (most common)
  • Embedding batch: Pre-computed embedding vectors (for specialized use cases)
  • Position tracking: Each token can have an explicit position, or positions can be tracked automatically
  • Sequence IDs: Each token can belong to one or more sequences, enabling parallel generation within a single decode call
  • Output selection: The logits mask controls which positions produce output logits, avoiding unnecessary computation

Internally, a large batch may be split into micro-batches (ubatches) bounded by the context's n_ubatch parameter. Each micro-batch is processed independently through the compute graph, keeping peak memory usage bounded regardless of the total batch size.

Return Codes and Error Handling

The decode function uses a structured return code system:

  • 0 -- success
  • 1 -- could not find a KV slot for the batch (try reducing batch size or increasing context size)
  • 2 -- aborted (by the abort callback; processed micro-batches remain in memory)
  • -1 -- invalid input batch
  • < -1 -- fatal error (processed micro-batches remain in memory)

On success or a KV slot warning (code 1), the memory state is consistent. On abort or fatal error, partially processed micro-batches remain in the KV cache and must be accounted for using llama_memory_seq_pos_min() and llama_memory_seq_pos_max().

Positional Encoding (RoPE)

Transformer models need position information because self-attention is permutation-invariant. llama.cpp uses Rotary Position Embedding (RoPE), which encodes absolute position by rotating the query and key vectors in pairs of dimensions:

RoPE(x, pos) = x * cos(pos * theta) + rotate(x) * sin(pos * theta)

where theta varies by dimension according to a geometric sequence. RoPE has the property that the dot product between two rotated vectors depends only on their relative position, not their absolute positions. This makes it compatible with KV caching and allows context extension techniques like YaRN.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment