Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Alibaba ROLL Numerical Stability Epsilon

From Leeroopedia



Knowledge Sources
Domains Optimization, Debugging, Numerical_Computing
Last Updated 2026-02-07 19:00 GMT

Overview

Systematic use of epsilon values (1e-6 to 1e-24) to prevent division-by-zero and NaN propagation across loss aggregation, reward normalization, and running statistics.

Description

ROLL uses a hierarchy of epsilon values for numerical stability throughout its computation pipeline. The values are not arbitrary; each level serves a specific purpose. `1e-8` is the standard epsilon for masked operations and loss aggregation where zero-denominator is possible but unlikely. `1e-6` is used for reward normalization where variance may legitimately be near-zero. `1e-9` is used in divergence calculations where higher precision is needed. `1e-24` is used as the initial count for running statistics to prevent division-by-zero at the very first update. The KL divergence `k3` penalty mode clamps to [-10, 10] to prevent extreme exponential values.

Usage

Apply these epsilon patterns when implementing new loss functions, reward computations, or normalization routines in ROLL. The hierarchy ensures consistent numerical behavior: use `1e-8` for general masked operations, `1e-6` for reward/advantage normalization, and `1e-9` for distribution divergence calculations.

The Insight (Rule of Thumb)

  • Action: Always add epsilon to denominators in masked mean, variance, and normalization operations.
  • Value: `1e-8` for loss/masked ops, `1e-6` for reward normalization, `1e-9` for divergence, `1e-24` for initial count.
  • Trade-off: Larger epsilon values (e.g., 1e-6) provide more safety but introduce tiny bias; smaller values (e.g., 1e-9) are more precise but closer to float precision limits.
  • Clipping: KL divergence k3 penalty must be clamped to [-10, 10] to prevent exp() overflow.

Reasoning

Division-by-zero errors and NaN propagation are the most common silent failures in RL training. A single NaN in a reward or advantage computation can corrupt an entire training run. The epsilon values are chosen to be small enough to not affect normal computation (rewards typically range from -10 to 10) while large enough to prevent numerical underflow in float32/float16 arithmetic. The `1e-24` initial count for running statistics ensures the very first batch update doesn't produce infinity.

Code from `roll/utils/functionals.py:135,250,276,574`:

# Running statistics initialization
self.count = 1e-24

# Masked mean with epsilon
return (tensor * mask).sum() / (mask.sum() + 1e-8)

# Loss aggregation
loss = (seq_losses * weights * valid_samples).sum() / (global_valid_samples + 1e-8)

# Reward normalization
normalized_rewards = (rewards - reward_mean) / (reward_std + 1e-6)

KL divergence clamping from `roll/utils/functionals.py:190`:

log_ratio = torch.clamp(kld, min=-10, max=10)

Logarithm safety from `roll/pipeline/rlvr/actor_pg_worker.py:426`:

log_ratio = torch.log(ratio + 1e-8)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment