Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Turboderp org Exllamav2 Sampling Configuration

From Leeroopedia
Knowledge Sources
Domains Text_Generation, Sampling, NLP
Last Updated 2026-02-15 00:00 GMT

Overview

Token sampling controls the randomness and quality of generated text by transforming a model's raw logit output into a probability distribution from which the next token is selected.

Description

After a language model produces logits (unnormalized scores) for each token in the vocabulary, sampling strategies determine which token is actually selected. Different strategies offer different trade-offs between creativity (diversity) and coherence (quality):

  • Temperature: Scales logits before softmax. Higher temperature (>1.0) flattens the distribution, increasing randomness. Lower temperature (<1.0) sharpens it, favoring high-probability tokens. Temperature of 0 is equivalent to greedy decoding.
  • Top-k: Restricts sampling to the k highest-probability tokens, zeroing out all others. Prevents sampling from the long tail of unlikely tokens.
  • Top-p (nucleus sampling): Restricts sampling to the smallest set of tokens whose cumulative probability exceeds p. Dynamically adjusts the candidate set size based on the distribution's shape, unlike the fixed cutoff of top-k.
  • Min-p: Filters out tokens whose probability is less than min_p times the probability of the most likely token. Provides a relative threshold that adapts to the distribution.
  • Typical sampling: Selects tokens whose information content (negative log probability) is close to the expected information content of the distribution, filtering out both very common and very rare tokens.
  • Tail-free sampling (TFS): Uses the second derivative of the sorted probability distribution to identify and remove the "tail" of low-probability tokens.
  • Mirostat: An adaptive sampling algorithm that targets a specific perplexity (tau) and adjusts the effective top-k dynamically using a learning rate (eta) to maintain consistent text quality.
  • Repetition penalties: Reduce the probability of tokens that have already appeared in the context, discouraging repetitive text. Variants include token repetition penalty, frequency penalty, and presence penalty.
  • DRY (Don't Repeat Yourself): Penalizes tokens that would continue a sequence that has appeared before in the output, targeting phrase-level repetition.
  • XTC (Exclude Top Choices): With some probability, removes the highest-probability tokens from consideration, forcing more creative choices.
  • Smoothing factor: Applies quadratic smoothing to the logit distribution, reducing the gap between high and low probability tokens.

These strategies can be combined and applied in configurable order, allowing fine-grained control over the generation behavior.

Usage

Sampling configuration is used whenever generating text:

  • Greedy decoding (top_k=1): For deterministic, most-likely output
  • Creative writing: Higher temperature (0.8-1.2), moderate top-p (0.9-0.95)
  • Code generation: Lower temperature (0.2-0.5), lower top-p (0.8-0.9)
  • Chat: Balanced settings with repetition penalty
  • Constrained generation: Combined with grammar filters for structured output

Theoretical Basis

Temperature Scaling

# Given raw logits z_i for each token i in vocabulary V:
# Temperature-scaled probability:
p(x_i) = exp(z_i / T) / sum_j(exp(z_j / T))

# T = 1.0: standard softmax
# T > 1.0: more uniform (creative)
# T < 1.0: more peaked (deterministic)
# T -> 0: argmax (greedy)

Top-k Sampling

# Sort tokens by probability: p_1 >= p_2 >= ... >= p_|V|
# Keep only top k tokens:
candidates = {x_i : i <= k}
# Re-normalize probabilities over candidates

Nucleus (Top-p) Sampling

# Sort tokens by probability: p_1 >= p_2 >= ... >= p_|V|
# Find smallest k such that: sum(p_1, ..., p_k) >= p
# candidates = {x_1, ..., x_k}
# Re-normalize and sample from candidates

Mirostat

# Target perplexity: tau
# Learning rate: eta
# Maintain running estimate: mu (initialized to 2*tau)

# At each step:
# 1. Compute surprise: s = -log2(p(selected_token))
# 2. Update mu: mu = mu - eta * (s - tau)
# 3. For next step, use top-k where k = floor(2^mu)

Repetition Penalty

# For each token i that appeared in context:
# If logit z_i > 0: z_i = z_i / penalty
# If logit z_i < 0: z_i = z_i * penalty
# penalty > 1.0 reduces probability of repeated tokens

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment