Principle:Ggml org Llama cpp Sampler Chain Configuration
| Aspect | Detail |
|---|---|
| Principle Name | Sampler Chain Configuration |
| Category | Sampling |
| Workflow | Interactive_Chat |
| Applies To | llama.cpp |
| Status | Active |
Overview
Description
Sampler Chain Configuration is the principle of composing multiple token sampling strategies into an ordered processing pipeline (a "chain") that collectively determines how the next token is selected during text generation. Rather than applying a single sampling method, llama.cpp allows developers to stack multiple samplers in sequence, where each sampler transforms the token probability distribution before passing it to the next sampler in the chain. The final sampler in the chain is responsible for actually selecting a token from the modified distribution.
Usage
Sampler chain configuration is performed once during application startup, after model and context initialization but before the generation loop begins. The chain is created, individual samplers are added in the desired order, and the resulting chain object is then used repeatedly during token generation. The chain persists for the entire chat session and is freed only during shutdown.
The order of samplers in the chain is significant: earlier samplers modify the probability distribution that later samplers see. Filtering samplers (such as min-p or top-k) should generally precede temperature scaling, and the final sampler should be a selection sampler (such as dist for random sampling or greedy for deterministic selection).
Theoretical Basis
Token sampling in autoregressive language models operates on the logit vector produced by the model's final layer. The raw logits are real-valued scores for each token in the vocabulary, which can be converted to probabilities via softmax. Different sampling strategies modify these logits or probabilities in different ways:
- Temperature scales the logits by a factor of
1/t, controlling the sharpness of the distribution. Lower temperatures make the distribution more peaked (more deterministic), while higher temperatures flatten it (more random). - Top-k retains only the
ktokens with the highest probabilities, discarding the rest. - Top-p (nucleus sampling) retains the smallest set of tokens whose cumulative probability exceeds
p. - Min-p retains tokens whose probability is at least
ptimes the probability of the most likely token. - Repetition penalties reduce the probability of tokens that have recently appeared, discouraging repetition.
The chain pattern implements the Strategy and Chain of Responsibility design patterns. Each sampler is an independent strategy that can be tested and configured in isolation. When composed into a chain, they form a pipeline where each step refines the distribution for the next.
The final sampler in the chain must be a selection sampler that chooses a concrete token:
- greedy always picks the highest-probability token
- dist samples from the (modified) distribution using a random seed
- mirostat uses an adaptive algorithm to maintain a target entropy
For chat applications, a common configuration is min-p filtering followed by temperature scaling followed by distribution sampling, as this provides a good balance between coherence and variety in responses.