Principle:Ggml org Llama cpp Sampling System

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Sampling
Last Updated	2026-02-15 00:00 GMT

Overview

The Sampling System is the principle of defining the type interfaces and data structures for token sampling and speculative decoding samplers.

Description

This principle covers the header-level type definitions and interfaces that define how token sampling operates in llama.cpp. This includes the sampler chain architecture, individual sampler type interfaces (temperature, top-k, top-p, min-p, repetition penalties, grammar constraints), and the speculative decoding sampler interface that coordinates draft and target model sampling. These headers define the contracts that concrete sampler implementations must follow.

Usage

Apply this principle when implementing new sampling strategies, extending the sampler chain with custom samplers, or integrating speculative decoding with the sampling pipeline.

Theoretical Basis

Token sampling transforms raw logits (unnormalized log-probabilities) from the model into a selected token. The sampling system uses a chain-of-responsibility pattern where multiple samplers are composed in sequence, each modifying the token probability distribution. Common samplers include temperature scaling (controlling randomness), top-k filtering (keeping only the k most likely tokens), top-p (nucleus) sampling (keeping tokens whose cumulative probability exceeds p), min-p sampling (keeping tokens with probability at least min_p times the top token's probability), and repetition penalties (reducing the probability of recently generated tokens). The speculative decoding interface extends sampling to coordinate between a draft model that proposes tokens and a target model that verifies them.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment