Principle:Ggml org Ggml Token Sampling
Summary
Token Sampling is the process of selecting the next token from a probability distribution over the vocabulary. Various strategies exist to control the trade-off between creativity and coherence in text generation.
Theory
Top-k Sampling
Restrict the candidate set to the k highest-probability tokens, then renormalize the distribution and sample. This prevents extremely low-probability tokens from being selected.
- Keep only the k tokens with the highest logits
- Set all other logits to negative infinity (or zero probability)
- Renormalize the remaining probabilities and sample
Top-p / Nucleus Sampling
Restrict the candidate set to the smallest set of tokens whose cumulative probability is greater than or equal to p, then renormalize and sample.
- Sort tokens by probability in descending order
- Keep tokens until the cumulative probability >= p
- Renormalize the remaining probabilities and sample
Temperature Scaling
Sharpen or flatten the probability distribution by dividing logits by a temperature parameter T before applying softmax.
- T < 1.0 sharpens the distribution (more deterministic)
- T = 1.0 leaves the distribution unchanged
- T > 1.0 flattens the distribution (more random / creative)
Math
Softmax with Temperature
Given logits and temperature , the probability of token is:
P(x_i) = exp(logit_i / T) / sum(exp(logit_j / T))
Top-k Filter
Given the set of all tokens and parameter :
V_k = { x_i in V : rank(x_i) <= k } (sorted by descending logit)
Renormalize:
Top-p Filter
Given sorted probabilities and threshold :
V_p = { x_1, x_2, ..., x_m } where m is the smallest index such that sum(p_1..p_m) >= p
Renormalize over .
Trade-offs
- Temperature controls creativity vs. coherence: low temperature yields safe, repetitive text; high temperature yields diverse but potentially incoherent text
- Top-k prevents degenerate outputs by excluding the long tail, but uses a fixed cutoff regardless of distribution shape
- Top-p adapts the cutoff to the distribution shape, keeping more tokens when the distribution is flat and fewer when it is peaked
- Combining top-k and top-p provides robust sampling across diverse generation contexts