Principle:Ollama Ollama Token Sampling

Knowledge Sources	Ollama Nucleus Sampling Temperature Sampling
Domains	NLP, Probability, Inference
Last Updated	2026-02-14 00:00 GMT

Overview

A configurable token selection mechanism that transforms raw model logits into a probability distribution and samples the next token using temperature scaling, top-k filtering, top-p (nucleus) sampling, and min-p thresholding.

Description

Token Sampling is the core decoding step in autoregressive language model inference. After the model produces a logit vector over the entire vocabulary, the sampler applies a pipeline of transforms to select the next token. This pipeline controls the tradeoff between coherence (low temperature, greedy) and creativity (high temperature, diverse sampling).

The sampling pipeline supports:

Temperature scaling: Divides logits by temperature before softmax, controlling distribution sharpness.
Top-k filtering: Retains only the k highest-probability tokens.
Top-p (nucleus) sampling: Retains the smallest set of tokens whose cumulative probability exceeds p.
Min-p thresholding: Removes tokens with probability below min_p times the maximum probability.
Grammar-constrained sampling: Optionally applies a BNF grammar to mask tokens that would produce invalid output (e.g., for JSON generation).

Usage

Use this principle in any autoregressive text generation system where controllable diversity is needed. The sampling parameters (temperature, top_k, top_p, min_p, seed) are typically exposed as user-facing API options.

Theoretical Basis

The sampling pipeline processes logits through sequential transforms:

${logits}_{scaled} = \frac{logits}{T}$

$P (x_{i}) = \frac{e^{{logits}_{i}}}{\sum_{j} e^{{logits}_{j}}} (softmax)$

Top-k: Sort tokens by logit, keep only the top k.

Top-p: After softmax, accumulate probabilities from highest to lowest; keep tokens until cumulative probability exceeds p.

Min-p: Remove any token with probability below Failed to parse (syntax error): {\displaystyle \text{min\_p} \times P_{\max}} .

Greedy (T=0): Return argmax(logits) directly, skipping all stochastic transforms.

Grammar Masking: Before sampling, set logits to -∞ for tokens that would violate the grammar state.

Related Pages

Implemented By

Implementation:Ollama_Ollama_Sampler_Sample

Uses Heuristic

Heuristic:Ollama_Ollama_Sampling_Numerical_Stability

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment