Principle:Ggml org Ggml Token Sampling

Summary

Token Sampling is the process of selecting the next token from a probability distribution over the vocabulary. Various strategies exist to control the trade-off between creativity and coherence in text generation.

Theory

Top-k Sampling

Restrict the candidate set to the k highest-probability tokens, then renormalize the distribution and sample. This prevents extremely low-probability tokens from being selected.

Keep only the k tokens with the highest logits
Set all other logits to negative infinity (or zero probability)
Renormalize the remaining probabilities and sample

Top-p / Nucleus Sampling

Restrict the candidate set to the smallest set of tokens whose cumulative probability is greater than or equal to p, then renormalize and sample.

Sort tokens by probability in descending order
Keep tokens until the cumulative probability >= p
Renormalize the remaining probabilities and sample

Temperature Scaling

Sharpen or flatten the probability distribution by dividing logits by a temperature parameter T before applying softmax.

T < 1.0 sharpens the distribution (more deterministic)
T = 1.0 leaves the distribution unchanged
T > 1.0 flattens the distribution (more random / creative)

Math

Softmax with Temperature

Given logits $z_{i}$ and temperature $T$ , the probability of token $x_{i}$ is:

P(x_i) = exp(logit_i / T) / sum(exp(logit_j / T))

Top-k Filter

Given the set of all tokens $V$ and parameter $k$ :

V_k = { x_i in V : rank(x_i) <= k }  (sorted by descending logit)

Renormalize: $P^{'} (x_{i}) = P (x_{i}) / \sum_{x_{j} \in V_{k}} P (x_{j})$

Top-p Filter

Given sorted probabilities $p_{1} \geq p_{2} \geq \dots$ and threshold $p$ :

V_p = { x_1, x_2, ..., x_m }  where m is the smallest index such that sum(p_1..p_m) >= p

Renormalize over $V_{p}$ .

Trade-offs

Temperature controls creativity vs. coherence: low temperature yields safe, repetitive text; high temperature yields diverse but potentially incoherent text
Top-k prevents degenerate outputs by excluding the long tail, but uses a fixed cutoff regardless of distribution shape
Top-p adapts the cutoff to the distribution shape, keeping more tokens when the distribution is flat and fewer when it is peaked
Combining top-k and top-p provides robust sampling across diverse generation contexts

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment