Heuristic:Ollama Ollama Sampling Numerical Stability
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Inference |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Numerical stability techniques for token sampling: temperature clamping at 1e-7, max-normalized softmax to prevent overflow, and grammar fast-path optimization that checks only the top logit before full grammar application.
Description
The Ollama sampling pipeline applies three key numerical stability techniques to prevent NaN/Inf values and improve performance during token generation:
- Temperature clamping: The temperature divisor is clamped to a minimum of 1e-7 to prevent division by zero or extreme logit scaling.
- Max-normalized softmax: Before computing `exp(x)`, the maximum logit is subtracted from all values. This prevents float32 overflow while preserving relative probabilities.
- Grammar fast-path: When grammar-constrained generation is active, the system first checks if the highest-probability token is grammar-valid. If so, it skips the expensive full grammar application across all tokens.
Usage
These techniques are always active during inference and apply to the Sampler_Sample implementation. They are particularly critical when users set extreme sampling parameters (temperature near 0, very low top-p) or when models produce logits with very large magnitude differences.
The Insight (Rule of Thumb)
- Action: Clamp temperature to `max(temp, 1e-7)` before dividing logits.
- Action: Subtract max logit before computing exp() in softmax.
- Action: For grammar sampling, test only the top token first before applying grammar to all tokens.
- Value: Temperature floor of 1e-7 prevents numerical underflow. Max-normalization prevents float32 overflow (max ~3.4e38).
- Trade-off: The grammar fast-path saves significant computation when the most likely token is grammar-valid (common case), at the cost of one additional grammar check on the rare miss path.
Reasoning
Without temperature clamping, a temperature of 0 would cause division by zero, producing NaN values that propagate through the rest of the sampling chain. The 1e-7 floor effectively makes temperature=0 behave as "greedy" sampling (pick the highest logit) without special-case code.
Max-normalized softmax is a standard technique in numerical computing. Raw logit values can exceed +/- 100, and `exp(100)` would overflow float32. By subtracting the max, the largest exponent becomes `exp(0) = 1`, keeping all values in a safe range while preserving the probability distribution.
Temperature clamping from `sample/transforms.go:29-35`:
func temperature(ts []token, temp float32) {
// Ensure temperature clipping near 0 to avoid numerical instability
temp = max(temp, 1e-7)
for i := range ts {
ts[i].value = ts[i].value / temp
}
}
Max-normalized softmax from `sample/transforms.go:38-58`:
func softmax(ts []token) {
// Find max logit for numerical stability
maxLogit := float32(math.Inf(-1))
for _, t := range ts {
if t.value > maxLogit {
maxLogit = t.value
}
}
// Compute exp(x - max)
var sum float32
for i, v := range ts {
ts[i].value = float32(math.Exp(float64(v.value - maxLogit)))
sum += ts[i].value
}
// exp(x - max) / sum(exp(x - max))
for i := range ts {
ts[i].value /= sum
}
}
Parameter clamping from `sample/samplers.go:139-155`:
if temperature < 0.0 { temperature = 0.0 }
if topP < 0.0 { topP = 0.0 }
if topP >= 1.0 { topP = 1.0 }
if minP < 0.0 { minP = 0.0 }
if minP >= 1.0 { minP = 1.0 }