Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Ggml Autoregressive Generation

From Leeroopedia


Template:Principle

Summary

Autoregressive text generation is the process of generating text one token at a time by conditioning each new token on all previously generated tokens. In a language model, this corresponds to factoring the joint distribution as a product of conditionals:

P(x_1, x_2, ..., x_T) = P(x_1) * P(x_2 | x_1) * ... * P(x_T | x_1, ..., x_{T-1})

At each step t, the model predicts P(x_t | x_1, ..., x_{t-1}), a probability distribution over the entire vocabulary, and a sampling strategy selects the next token from that distribution.

The Generation Loop

The core autoregressive loop follows these stages:

  1. Encode the prompt -- Tokenize the input text into a sequence of token IDs and feed them through the model to populate internal state.
  2. Build the computation graph -- For each generation step, construct the forward-pass graph that maps the current token(s) and cached state to output logits.
  3. Compute -- Execute the graph on the chosen backend (CPU, GPU, etc.).
  4. Extract logits -- Read the output logits (unnormalized log-probabilities) for the last position, which form a vector of size n_vocab.
  5. Sample a token -- Apply a sampling strategy (greedy, top-k, top-p, temperature, etc.) to select the next token ID.
  6. Append to context -- Add the newly sampled token to the running context and advance the position counter.
  7. Repeat from step 2 until a stopping condition is met.

KV Cache

Key-Value (KV) caching is the primary optimization that makes autoregressive generation practical. During the attention computation, the key and value projections for all previously processed positions are stored in a cache. On each subsequent step, only the new position needs to be computed; the cached keys and values are reused. This reduces the per-step cost from O(T) full forward passes to an incremental computation over a single new token, while the attention operation itself still reads from the full cached sequence.

Stopping Conditions

Generation terminates when any of the following conditions is satisfied:

  • Maximum token count -- A pre-configured limit on the number of tokens to generate has been reached.
  • End-of-text token -- The model emits a special end-of-sequence token (e.g., <|endoftext|> in GPT-2), signalling that the model considers the sequence complete.

Related

Source

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment