Principle:Turboderp org Exllamav2 Sampling Configuration
| Knowledge Sources | |
|---|---|
| Domains | Text_Generation, Sampling, NLP |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Token sampling controls the randomness and quality of generated text by transforming a model's raw logit output into a probability distribution from which the next token is selected.
Description
After a language model produces logits (unnormalized scores) for each token in the vocabulary, sampling strategies determine which token is actually selected. Different strategies offer different trade-offs between creativity (diversity) and coherence (quality):
- Temperature: Scales logits before softmax. Higher temperature (>1.0) flattens the distribution, increasing randomness. Lower temperature (<1.0) sharpens it, favoring high-probability tokens. Temperature of 0 is equivalent to greedy decoding.
- Top-k: Restricts sampling to the k highest-probability tokens, zeroing out all others. Prevents sampling from the long tail of unlikely tokens.
- Top-p (nucleus sampling): Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Dynamically adjusts the candidate set size based on the distribution's shape, unlike the fixed cutoff of top-k.
- Min-p: Filters out tokens whose probability is less than min_p times the probability of the most likely token. Provides a relative threshold that adapts to the distribution.
- Typical sampling: Selects tokens whose information content (negative log probability) is close to the expected information content of the distribution, filtering out both very common and very rare tokens.
- Tail-free sampling (TFS): Uses the second derivative of the sorted probability distribution to identify and remove the "tail" of low-probability tokens.
- Mirostat: An adaptive sampling algorithm that targets a specific perplexity (tau) and adjusts the effective top-k dynamically using a learning rate (eta) to maintain consistent text quality.
- Repetition penalties: Reduce the probability of tokens that have already appeared in the context, discouraging repetitive text. Variants include token repetition penalty, frequency penalty, and presence penalty.
- DRY (Don't Repeat Yourself): Penalizes tokens that would continue a sequence that has appeared before in the output, targeting phrase-level repetition.
- XTC (Exclude Top Choices): With some probability, removes the highest-probability tokens from consideration, forcing more creative choices.
- Smoothing factor: Applies quadratic smoothing to the logit distribution, reducing the gap between high and low probability tokens.
These strategies can be combined and applied in configurable order, allowing fine-grained control over the generation behavior.
Usage
Sampling configuration is used whenever generating text:
- Greedy decoding (top_k=1): For deterministic, most-likely output
- Creative writing: Higher temperature (0.8-1.2), moderate top-p (0.9-0.95)
- Code generation: Lower temperature (0.2-0.5), lower top-p (0.8-0.9)
- Chat: Balanced settings with repetition penalty
- Constrained generation: Combined with grammar filters for structured output
Theoretical Basis
Temperature Scaling
# Given raw logits z_i for each token i in vocabulary V:
# Temperature-scaled probability:
p(x_i) = exp(z_i / T) / sum_j(exp(z_j / T))
# T = 1.0: standard softmax
# T > 1.0: more uniform (creative)
# T < 1.0: more peaked (deterministic)
# T -> 0: argmax (greedy)
Top-k Sampling
# Sort tokens by probability: p_1 >= p_2 >= ... >= p_|V|
# Keep only top k tokens:
candidates = {x_i : i <= k}
# Re-normalize probabilities over candidates
Nucleus (Top-p) Sampling
# Sort tokens by probability: p_1 >= p_2 >= ... >= p_|V|
# Find smallest k such that: sum(p_1, ..., p_k) >= p
# candidates = {x_1, ..., x_k}
# Re-normalize and sample from candidates
Mirostat
# Target perplexity: tau
# Learning rate: eta
# Maintain running estimate: mu (initialized to 2*tau)
# At each step:
# 1. Compute surprise: s = -log2(p(selected_token))
# 2. Update mu: mu = mu - eta * (s - tau)
# 3. For next step, use top-k where k = floor(2^mu)
Repetition Penalty
# For each token i that appeared in context:
# If logit z_i > 0: z_i = z_i / penalty
# If logit z_i < 0: z_i = z_i * penalty
# penalty > 1.0 reduces probability of repeated tokens