Principle:Ggml org Llama cpp Sampling System
| Knowledge Sources | |
|---|---|
| Domains | Sampling |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The Sampling System is the principle of defining the type interfaces and data structures for token sampling and speculative decoding samplers.
Description
This principle covers the header-level type definitions and interfaces that define how token sampling operates in llama.cpp. This includes the sampler chain architecture, individual sampler type interfaces (temperature, top-k, top-p, min-p, repetition penalties, grammar constraints), and the speculative decoding sampler interface that coordinates draft and target model sampling. These headers define the contracts that concrete sampler implementations must follow.
Usage
Apply this principle when implementing new sampling strategies, extending the sampler chain with custom samplers, or integrating speculative decoding with the sampling pipeline.
Theoretical Basis
Token sampling transforms raw logits (unnormalized log-probabilities) from the model into a selected token. The sampling system uses a chain-of-responsibility pattern where multiple samplers are composed in sequence, each modifying the token probability distribution. Common samplers include temperature scaling (controlling randomness), top-k filtering (keeping only the k most likely tokens), top-p (nucleus) sampling (keeping tokens whose cumulative probability exceeds p), min-p sampling (keeping tokens with probability at least min_p times the top token's probability), and repetition penalties (reducing the probability of recently generated tokens). The speculative decoding interface extends sampling to coordinate between a draft model that proposes tokens and a target model that verifies them.