Principle:Ollama Ollama Sampling Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Sampling, Inference |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
The Sampling Pipeline defines the ordered sequence of transformations applied to raw model logits to select the next token during autoregressive generation, combining filtering, scaling, normalization, and stochastic selection into a composable chain.
Core Concepts
Pipeline Architecture
The sampling pipeline processes logits through a fixed sequence of stages: top-k filtering, temperature scaling, softmax normalization, top-p (nucleus) filtering, min-p thresholding, and finally stochastic token selection via cumulative distribution sampling. Each stage narrows or reshapes the probability distribution before the final selection. This ordered pipeline ensures that filtering and scaling interact predictably.
Greedy vs. Stochastic Paths
When temperature is set to zero, the pipeline short-circuits to greedy decoding by returning the argmax token directly, bypassing all stochastic stages. This provides deterministic output for applications requiring reproducibility. For non-zero temperatures, the full stochastic pipeline runs, producing varied outputs that balance coherence and creativity.
Grammar Integration
An optional grammar sampler can be inserted into the pipeline to constrain output to valid sequences according to a formal grammar (typically BNF-derived). The grammar sampler masks logits for tokens that would produce invalid output by setting them to negative infinity before the standard sampling stages. As an optimization, the pipeline first checks whether the top-ranked token is grammar-valid; if so, it skips the expensive full-vocabulary grammar application.
Sampler Configuration
The pipeline is parameterized by temperature, top-k, top-p, min-p, and an optional random seed. These parameters are exposed through the Ollama API and Modelfile PARAMETER directives, allowing per-request or per-model configuration. The NewSampler constructor validates and normalizes these parameters, clamping values to valid ranges.
Reproducibility
When a seed is provided, the pipeline uses a deterministic PCG random number generator initialized from the seed. This ensures that identical inputs with identical seeds produce identical outputs, enabling reproducible experiments and debugging.
Implementation Notes
The sampling pipeline is implemented in sample/samplers.go. The Sampler struct holds the configuration and random state. The Sample method is the entry point that orchestrates the full pipeline. Individual transform stages (topK, temperature, softmax, topP, minP) are implemented in sample/transforms.go as standalone functions that operate on token slices. The grammar sampler wraps llama.cpp's grammar implementation and is integrated as an optional post-processing step.