Principle:Ggml org Llama cpp Response Generation
| Aspect | Detail |
|---|---|
| Principle Name | Response Generation |
| Category | Inference |
| Workflow | Interactive_Chat |
| Applies To | llama.cpp |
| Status | Active |
Overview
Description
Response Generation is the principle of autoregressive token generation: the iterative decode-sample-output loop that produces the model's response one token at a time. In this process, the model first processes the input prompt (all tokens at once), then enters a loop where it decodes the current state, samples the next token from the output logits, checks whether the token signals end-of-generation, and if not, converts it to text, outputs it to the user, and feeds it back as input for the next iteration.
Usage
Response generation is the core inner loop of any chat application. It is invoked once per assistant turn, after the user's message has been formatted through the chat template and tokenized. The loop runs until the model produces an end-of-generation (EOG) token or the context window is exhausted. The accumulated output tokens form the assistant's response, which is then added to the conversation history.
Theoretical Basis
Autoregressive generation: Transformer-based language models generate text autoregressively, meaning each new token is conditioned on all previously generated tokens plus the original prompt. The model does not "plan" its response in advance; instead, it makes a local decision at each step about which token to produce next, based on the probability distribution over the vocabulary.
The generation process consists of two distinct phases:
Phase 1 -- Prompt processing (prefill): All tokens of the input prompt are processed in a single batch via llama_decode. This populates the KV cache with the hidden states for all prompt positions. The batch is created using llama_batch_get_one, which packages a token array into a batch structure with automatic position tracking.
Phase 2 -- Token-by-token generation (decode): After the prompt is processed, the loop enters the incremental generation phase:
- Decode: Call
llama_decodewith a single-token batch containing the most recently sampled token. This extends the KV cache by one position and produces logits for the next token. - Sample: Call
llama_sampler_samplewith the sampler chain and the context. The sampler chain applies its configured filtering and selection strategies to the logits and returns a token ID. - Check EOG: Call
llama_vocab_is_eogto determine if the sampled token is an end-of-generation marker (such as EOS, EOT, or template-specific end tokens). If so, exit the loop. - Convert and output: Call
llama_token_to_pieceto convert the token ID to its UTF-8 text representation. Print it immediately (withfflush) for streaming output, and append it to the response buffer. - Prepare next batch: Create a new single-token batch with the sampled token and return to step 1.
Streaming output: Chat applications display tokens as they are generated rather than waiting for the complete response. This is achieved by calling printf and fflush(stdout) after each token conversion. This provides the user with immediate feedback and a sense of the model "thinking in real time."
First-token detection: The generation function must determine whether the current prompt is the first in the conversation (for proper BOS token handling). This is done by checking llama_memory_seq_pos_max(memory, 0) == -1, which returns -1 when no tokens have been stored in the KV cache for sequence 0.
Context overflow: Before each decode call, the loop checks whether there is sufficient remaining space in the context window. If n_ctx_used + batch.n_tokens > n_ctx, the context is full and generation must stop. More sophisticated applications may implement context shifting or pruning strategies.