Heuristic:Ollama Ollama Sampling Numerical Stability

Knowledge Sources	Ollama Internal
Domains	Optimization, LLMs, Inference
Last Updated	2026-02-14 22:00 GMT

Overview

Numerical stability techniques for token sampling: temperature clamping at 1e-7, max-normalized softmax to prevent overflow, and grammar fast-path optimization that checks only the top logit before full grammar application.

Description

The Ollama sampling pipeline applies three key numerical stability techniques to prevent NaN/Inf values and improve performance during token generation:

Temperature clamping: The temperature divisor is clamped to a minimum of 1e-7 to prevent division by zero or extreme logit scaling.
Max-normalized softmax: Before computing `exp(x)`, the maximum logit is subtracted from all values. This prevents float32 overflow while preserving relative probabilities.
Grammar fast-path: When grammar-constrained generation is active, the system first checks if the highest-probability token is grammar-valid. If so, it skips the expensive full grammar application across all tokens.

Usage

These techniques are always active during inference and apply to the Sampler_Sample implementation. They are particularly critical when users set extreme sampling parameters (temperature near 0, very low top-p) or when models produce logits with very large magnitude differences.

The Insight (Rule of Thumb)

Action: Clamp temperature to `max(temp, 1e-7)` before dividing logits.
Action: Subtract max logit before computing exp() in softmax.
Action: For grammar sampling, test only the top token first before applying grammar to all tokens.
Value: Temperature floor of 1e-7 prevents numerical underflow. Max-normalization prevents float32 overflow (max ~3.4e38).
Trade-off: The grammar fast-path saves significant computation when the most likely token is grammar-valid (common case), at the cost of one additional grammar check on the rare miss path.

Reasoning

Without temperature clamping, a temperature of 0 would cause division by zero, producing NaN values that propagate through the rest of the sampling chain. The 1e-7 floor effectively makes temperature=0 behave as "greedy" sampling (pick the highest logit) without special-case code.

Max-normalized softmax is a standard technique in numerical computing. Raw logit values can exceed +/- 100, and `exp(100)` would overflow float32. By subtracting the max, the largest exponent becomes `exp(0) = 1`, keeping all values in a safe range while preserving the probability distribution.

Temperature clamping from `sample/transforms.go:29-35`:

func temperature(ts []token, temp float32) {
    // Ensure temperature clipping near 0 to avoid numerical instability
    temp = max(temp, 1e-7)
    for i := range ts {
        ts[i].value = ts[i].value / temp
    }
}

Max-normalized softmax from `sample/transforms.go:38-58`:

func softmax(ts []token) {
    // Find max logit for numerical stability
    maxLogit := float32(math.Inf(-1))
    for _, t := range ts {
        if t.value > maxLogit {
            maxLogit = t.value
        }
    }
    // Compute exp(x - max)
    var sum float32
    for i, v := range ts {
        ts[i].value = float32(math.Exp(float64(v.value - maxLogit)))
        sum += ts[i].value
    }
    // exp(x - max) / sum(exp(x - max))
    for i := range ts {
        ts[i].value /= sum
    }
}

Parameter clamping from `sample/samplers.go:139-155`:

if temperature < 0.0 { temperature = 0.0 }
if topP < 0.0 { topP = 0.0 }
if topP >= 1.0 { topP = 1.0 }
if minP < 0.0 { minP = 0.0 }
if minP >= 1.0 { minP = 1.0 }

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment