Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Ollama Ollama Token Sampling

From Leeroopedia
Knowledge Sources
Domains NLP, Probability, Inference
Last Updated 2026-02-14 00:00 GMT

Overview

A configurable token selection mechanism that transforms raw model logits into a probability distribution and samples the next token using temperature scaling, top-k filtering, top-p (nucleus) sampling, and min-p thresholding.

Description

Token Sampling is the core decoding step in autoregressive language model inference. After the model produces a logit vector over the entire vocabulary, the sampler applies a pipeline of transforms to select the next token. This pipeline controls the tradeoff between coherence (low temperature, greedy) and creativity (high temperature, diverse sampling).

The sampling pipeline supports:

  • Temperature scaling: Divides logits by temperature before softmax, controlling distribution sharpness.
  • Top-k filtering: Retains only the k highest-probability tokens.
  • Top-p (nucleus) sampling: Retains the smallest set of tokens whose cumulative probability exceeds p.
  • Min-p thresholding: Removes tokens with probability below min_p times the maximum probability.
  • Grammar-constrained sampling: Optionally applies a BNF grammar to mask tokens that would produce invalid output (e.g., for JSON generation).

Usage

Use this principle in any autoregressive text generation system where controllable diversity is needed. The sampling parameters (temperature, top_k, top_p, min_p, seed) are typically exposed as user-facing API options.

Theoretical Basis

The sampling pipeline processes logits through sequential transforms:

logitsscaled=logitsT

P(xi)=elogitsijelogitsj(softmax)

Top-k: Sort tokens by logit, keep only the top k.

Top-p: After softmax, accumulate probabilities from highest to lowest; keep tokens until cumulative probability exceeds p.

Min-p: Remove any token with probability below Failed to parse (syntax error): {\displaystyle \text{min\_p} \times P_{\max}} .

Greedy (T=0): Return argmax(logits) directly, skipping all stochastic transforms.

Grammar Masking: Before sampling, set logits to -∞ for tokens that would violate the grammar state.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment