Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ggml org Llama cpp Token Sampling

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp Sampling Strategies, Probability Theory, Temperature Scaling, Top-k, Top-p, Min-p, Mirostat 2026-02-14

Overview

Description

Token Sampling is the final step in the llama.cpp text generation pipeline that selects the next token from the probability distribution produced by the transformer's forward pass. After llama_decode computes raw logits (unnormalized log-probabilities) over the entire vocabulary for a given position, a sampling strategy must decide which single token to output. The choice of sampling strategy profoundly affects the quality, creativity, coherence, and determinism of generated text.

llama.cpp implements a sampler chain architecture where multiple sampling operations are composed sequentially. Each sampler in the chain transforms the candidate token distribution (filtering, reweighting, or selecting) before passing it to the next. This modular design allows users to combine strategies freely -- for example, applying temperature scaling, then top-k filtering, then top-p filtering, then making a final probabilistic selection.

Usage

Sampling is performed after each llama_decode call to select the next token. The sampler chain is initialized once and reused across all generation steps.

// Initialize sampler chain
auto sparams = llama_sampler_chain_default_params();
llama_sampler * smpl = llama_sampler_chain_init(sparams);

// Add samplers to the chain (order matters)
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(42));

// Sample a token after decode
llama_token token = llama_sampler_sample(smpl, ctx, -1);

Theoretical Basis

From Logits to Probabilities

The transformer outputs a vector of raw logits z = [z_1, z_2, ..., z_V] where V is the vocabulary size. These logits are converted to a probability distribution using the softmax function:

P(token_i) = exp(z_i) / sum_j(exp(z_j))

The sampling strategies below operate either on the logits (before softmax) or on the probabilities (after softmax), filtering and reshaping the distribution before a final token is drawn.

Greedy Sampling

Definition: Always select the token with the highest probability.

token = argmax_i(P(token_i))

Properties:

  • Fully deterministic -- the same input always produces the same output
  • Produces the locally most probable sequence at each step (but not necessarily the globally most probable sequence)
  • Tends to produce repetitive, "safe" text that lacks variety
  • Useful for factual tasks where accuracy matters more than creativity

In llama.cpp: llama_sampler_init_greedy()

Temperature Sampling

Definition: Scale the logits by a temperature parameter T before applying softmax.

P(token_i | T) = exp(z_i / T) / sum_j(exp(z_j / T))

Properties:

  • T = 1.0 -- no change; use the model's original distribution
  • T < 1.0 -- "sharpens" the distribution, making high-probability tokens more dominant (approaches greedy as T approaches 0)
  • T > 1.0 -- "flattens" the distribution, giving lower-probability tokens more chance (approaches uniform as T approaches infinity)
  • T <= 0.0 -- special case in llama.cpp: the maximum logit is preserved, all others are set to negative infinity (equivalent to greedy)

In llama.cpp: llama_sampler_init_temp(float t)

Dynamic temperature (entropy-based) adjusts T based on the entropy of the current distribution, increasing randomness when the model is confident and decreasing it when uncertain:

In llama.cpp: llama_sampler_init_temp_ext(float t, float delta, float exponent)

Top-k Sampling

Definition: Keep only the k tokens with the highest probabilities; set all others to zero probability.

Algorithm:

  1. Sort tokens by probability in descending order
  2. Keep the top k tokens
  3. Renormalize the remaining probabilities to sum to 1
  4. Sample from the truncated distribution

Properties:

  • Prevents sampling from the long tail of low-probability tokens that are often nonsensical
  • The fixed cutoff k is independent of the distribution shape, which can be too restrictive (cutting off valid tokens when the distribution is flat) or too permissive (keeping many irrelevant tokens when the distribution is peaked)
  • Setting k <= 0 disables this sampler (noop)

In llama.cpp: llama_sampler_init_top_k(int32_t k)

Reference: "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019)

Top-p (Nucleus) Sampling

Definition: Keep the smallest set of tokens whose cumulative probability exceeds a threshold p.

Algorithm:

  1. Sort tokens by probability in descending order
  2. Compute the cumulative sum of probabilities
  3. Find the smallest set of tokens whose cumulative probability >= p
  4. Renormalize and sample from this set

Properties:

  • Adapts to the distribution shape: when the model is confident (peaked distribution), fewer tokens are kept; when uncertain (flat distribution), more tokens are included
  • p = 1.0 -- no filtering (all tokens kept)
  • p = 0.0 -- only the top token (equivalent to greedy)
  • Typically more robust than fixed top-k because the effective vocabulary size varies per step

In llama.cpp: llama_sampler_init_top_p(float p, size_t min_keep)

Reference: "The Curious Case of Neural Text Degeneration" (Holtzman et al., 2019)

Min-p Sampling

Definition: Keep all tokens whose probability is at least a fraction p of the maximum token's probability.

Algorithm:

  1. Find the maximum probability p_max
  2. Compute threshold = p * p_max
  3. Keep all tokens with probability >= threshold
  4. Renormalize and sample

Properties:

  • Adapts to distribution shape like top-p, but with a simpler and more intuitive threshold
  • When the model is very confident (one dominant token), most tokens are filtered out
  • When the model is uncertain (flat distribution), more tokens survive
  • Often produces higher quality outputs than top-p for the same level of diversity

In llama.cpp: llama_sampler_init_min_p(float p, size_t min_keep)

Mirostat Sampling

Definition: An adaptive sampling algorithm that maintains a target surprise (cross-entropy) level across the generated sequence.

Core idea: Instead of fixing a static probability threshold, Mirostat dynamically adjusts how many tokens to consider at each step to maintain a consistent level of "surprise" (information content) per token. This produces text with more consistent perplexity.

Algorithm (Mirostat 2.0):

  1. Set target surprise tau (e.g., 5.0) and learning rate eta (e.g., 0.1)
  2. Initialize mu = 2 * tau
  3. At each step:
    1. Sort tokens by probability
    2. Find the top-k tokens where k is derived from mu
    3. Sample a token from these top-k tokens
    4. Compute the surprise of the sampled token: s = -log2(P(token))
    5. Update mu: mu = mu - eta * (s - tau)

Properties:

  • Automatically adapts the effective vocabulary size to maintain consistent text quality
  • Higher tau produces more creative/surprising text; lower tau produces more predictable text
  • The adaptive mechanism prevents the "repetition collapse" and "incoherence spike" failure modes common with static sampling

In llama.cpp:

  • Mirostat 1.0: llama_sampler_init_mirostat(int32_t n_vocab, uint32_t seed, float tau, float eta, int32_t m)
  • Mirostat 2.0: llama_sampler_init_mirostat_v2(uint32_t seed, float tau, float eta)

Reference: "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity" (Basu et al., 2020)

Repetition Penalties

Penalty samplers modify logits to discourage repeating recently generated tokens:

  • Repeat penalty -- multiplies the logit of any token that appeared in the last N tokens by a penalty factor (values > 1.0 reduce the probability of repetition)
  • Frequency penalty -- subtracts a value proportional to how many times a token appeared in the last N tokens
  • Presence penalty -- subtracts a fixed value for any token that appeared at least once in the last N tokens

In llama.cpp: llama_sampler_init_penalties(int32_t penalty_last_n, float penalty_repeat, float penalty_freq, float penalty_present)

The Sampler Chain Architecture

llama.cpp's sampler chain processes candidates through an ordered sequence of samplers. Each sampler either:

  • Filters -- removes candidates (top-k, top-p, min-p)
  • Transforms -- modifies logits/probabilities (temperature, penalties)
  • Selects -- picks the final token (greedy, dist, mirostat)

The chain must end with a selecting sampler. A typical chain ordering is:

  1. Repetition penalties (operate on full logit set)
  2. Top-k filtering (reduce candidate count)
  3. Top-p or min-p filtering (further refine candidates)
  4. Temperature scaling (adjust distribution shape)
  5. Final selection (greedy or probabilistic)

Additional Sampling Strategies

llama.cpp also provides:

  • Locally Typical Sampling -- selects tokens near the expected information content: llama_sampler_init_typical(float p, size_t min_keep)
  • XTC (eXtended Token Control) -- as described in text-generation-webui: llama_sampler_init_xtc(float p, float t, size_t min_keep, uint32_t seed)
  • Top-n-sigma -- keeps tokens within n standard deviations of the mean logit: llama_sampler_init_top_n_sigma(float n)
  • DRY (Don't Repeat Yourself) -- advanced anti-repetition sampler: llama_sampler_init_dry(...)
  • Grammar-constrained sampling -- restricts output to tokens valid under a GBNF grammar: llama_sampler_init_grammar(...)
  • Logit bias -- manually adjust logits for specific tokens: llama_sampler_init_logit_bias(...)
  • Adaptive-p -- selects tokens near a configurable target probability with EMA adaptation: llama_sampler_init_adaptive_p(float target, float decay, uint32_t seed)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment