Principle:Ollama Ollama Sampling Pipeline

Knowledge Sources	Ollama
Domains	Sampling, Inference
Last Updated	2025-02-15 00:00 GMT

Overview

The Sampling Pipeline defines the ordered sequence of transformations applied to raw model logits to select the next token during autoregressive generation, combining filtering, scaling, normalization, and stochastic selection into a composable chain.

Core Concepts

Pipeline Architecture

The sampling pipeline processes logits through a fixed sequence of stages: top-k filtering, temperature scaling, softmax normalization, top-p (nucleus) filtering, min-p thresholding, and finally stochastic token selection via cumulative distribution sampling. Each stage narrows or reshapes the probability distribution before the final selection. This ordered pipeline ensures that filtering and scaling interact predictably.

Greedy vs. Stochastic Paths

When temperature is set to zero, the pipeline short-circuits to greedy decoding by returning the argmax token directly, bypassing all stochastic stages. This provides deterministic output for applications requiring reproducibility. For non-zero temperatures, the full stochastic pipeline runs, producing varied outputs that balance coherence and creativity.

Grammar Integration

An optional grammar sampler can be inserted into the pipeline to constrain output to valid sequences according to a formal grammar (typically BNF-derived). The grammar sampler masks logits for tokens that would produce invalid output by setting them to negative infinity before the standard sampling stages. As an optimization, the pipeline first checks whether the top-ranked token is grammar-valid; if so, it skips the expensive full-vocabulary grammar application.

Sampler Configuration

The pipeline is parameterized by temperature, top-k, top-p, min-p, and an optional random seed. These parameters are exposed through the Ollama API and Modelfile PARAMETER directives, allowing per-request or per-model configuration. The NewSampler constructor validates and normalizes these parameters, clamping values to valid ranges.

Reproducibility

When a seed is provided, the pipeline uses a deterministic PCG random number generator initialized from the seed. This ensures that identical inputs with identical seeds produce identical outputs, enabling reproducible experiments and debugging.

Implementation Notes

The sampling pipeline is implemented in sample/samplers.go. The Sampler struct holds the configuration and random state. The Sample method is the entry point that orchestrates the full pipeline. Individual transform stages (topK, temperature, softmax, topP, minP) are implemented in sample/transforms.go as standalone functions that operate on token slices. The grammar sampler wraps llama.cpp's grammar implementation and is integrated as an optional post-processing step.

Related Pages

Implementation:Ollama_Ollama_Llama_Sampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment