Principle:Vllm project Vllm Sampling Configuration
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Text Generation |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Sampling configuration is the specification of hyperparameters that control the stochastic token-selection process during autoregressive text generation.
Description
When a language model produces a probability distribution over the vocabulary at each decoding step, the sampling configuration determines how the next token is chosen from that distribution. Different parameter settings yield dramatically different output characteristics, ranging from deterministic greedy decoding to highly creative stochastic generation.
The core sampling parameters include:
- Temperature: Scales the logits before applying softmax. A temperature of 0 produces greedy (argmax) decoding. Values below 1.0 sharpen the distribution (more deterministic), while values above 1.0 flatten it (more random).
- Top-p (nucleus sampling): Restricts sampling to the smallest set of tokens whose cumulative probability exceeds the threshold p. This dynamically adjusts the candidate pool size based on the shape of the distribution.
- Top-k: Restricts sampling to the k most probable tokens. Unlike top-p, the candidate pool size is fixed regardless of the distribution shape.
- Min-p: Filters out tokens whose probability is below a fraction of the most likely token's probability. This provides an alternative to top-k/top-p that adapts to the absolute confidence of the model.
- Max tokens: Hard limit on the number of tokens to generate per output sequence.
- Stop sequences: Strings or token IDs that terminate generation when produced.
- Seed: Fixes the random number generator state for reproducible outputs.
- Penalty parameters: Frequency penalty, presence penalty, and repetition penalty discourage or encourage token reuse.
Usage
Configure sampling parameters whenever you need to control the quality, diversity, or length of generated text. Use low temperature and greedy decoding for factual tasks (summarization, extraction). Use higher temperature with top-p or top-k for creative tasks (story writing, brainstorming). Set a seed for reproducible experiments.
Theoretical Basis
The standard autoregressive generation process at step t computes:
P(x_t | x_{<t}) = softmax(z_t / T)
where z_t is the logit vector and T is the temperature.
Nucleus (top-p) sampling (Holtzman et al., 2019) selects the smallest token set V_p such that:
sum_{x in V_p} P(x | x_{<t}) >= p
The distribution is then renormalized over V_p. This adapts the candidate pool dynamically: when the model is confident, few tokens are considered; when uncertain, more tokens are included.
Top-k sampling truncates to a fixed set of the k most probable tokens and renormalizes.
Repetition penalties modify the logits before sampling:
- Frequency penalty: Subtracts a value proportional to how many times a token has appeared.
- Presence penalty: Subtracts a fixed value for any token that has appeared at all.
- Repetition penalty: Divides the logit by the penalty factor if the token has appeared (for logits > 0) or multiplies by it (for logits < 0).
When multiple filtering methods are active, they are typically applied in sequence: repetition penalties first, then temperature scaling, then top-k, then top-p, then min-p.