Principle:Ggml org Ggml Autoregressive Generation
Summary
Autoregressive text generation is the process of generating text one token at a time by conditioning each new token on all previously generated tokens. In a language model, this corresponds to factoring the joint distribution as a product of conditionals:
P(x_1, x_2, ..., x_T) = P(x_1) * P(x_2 | x_1) * ... * P(x_T | x_1, ..., x_{T-1})
At each step t, the model predicts P(x_t | x_1, ..., x_{t-1}), a probability distribution over the entire vocabulary, and a sampling strategy selects the next token from that distribution.
The Generation Loop
The core autoregressive loop follows these stages:
- Encode the prompt -- Tokenize the input text into a sequence of token IDs and feed them through the model to populate internal state.
- Build the computation graph -- For each generation step, construct the forward-pass graph that maps the current token(s) and cached state to output logits.
- Compute -- Execute the graph on the chosen backend (CPU, GPU, etc.).
- Extract logits -- Read the output logits (unnormalized log-probabilities) for the last position, which form a vector of size
n_vocab. - Sample a token -- Apply a sampling strategy (greedy, top-k, top-p, temperature, etc.) to select the next token ID.
- Append to context -- Add the newly sampled token to the running context and advance the position counter.
- Repeat from step 2 until a stopping condition is met.
KV Cache
Key-Value (KV) caching is the primary optimization that makes autoregressive generation practical. During the attention computation, the key and value projections for all previously processed positions are stored in a cache. On each subsequent step, only the new position needs to be computed; the cached keys and values are reused. This reduces the per-step cost from O(T) full forward passes to an incremental computation over a single new token, while the attention operation itself still reads from the full cached sequence.
Stopping Conditions
Generation terminates when any of the following conditions is satisfied:
- Maximum token count -- A pre-configured limit on the number of tokens to generate has been reached.
- End-of-text token -- The model emits a special end-of-sequence token (e.g.,
<|endoftext|>in GPT-2), signalling that the model considers the sequence complete.