Heuristic:NVIDIA NeMo Aligner Higher Stability Log Probs

Knowledge Sources	NeMo-Aligner NVIDIA Internal
Domains	Optimization, Numerical_Stability, DPO
Last Updated	2026-02-07 22:00 GMT

Overview

Numerical stability technique using distributed log-softmax instead of softmax to prevent -inf log probabilities in DPO and other alignment algorithms.

Description

When computing token-level log probabilities from vocabulary-parallel logits in distributed training, NeMo-Aligner provides two modes: a standard softmax path and a higher-stability log-softmax path. The standard path computes softmax first, then takes the log, which can produce `-inf` values when probabilities are extremely small. The higher-stability path computes log-softmax directly, which is mathematically equivalent but numerically more stable because it avoids the intermediate small-probability values. The trade-off is that the higher-stability path requires more VRAM due to an unavoidable `exp()` operation on the full logits tensor.

Usage

Use this heuristic when training DPO models or any model where you observe `-inf` values in log probabilities. DPO is especially susceptible because it computes log-probability ratios between policy and reference models, where small differences in logits can produce extreme log-probability values. PPO always uses `higher_stability=True` by default.

The Insight (Rule of Thumb)

Action: Set `higher_stability=True` when calling `from_parallel_logits_to_logprobs()`.
Value: Boolean flag, no tuning needed.
Trade-off: Increases VRAM usage (full logits tensor must be exponentiated) in exchange for eliminating `-inf` log probabilities.
When Required: DPO training will produce `-inf` logprobs without this flag. PPO already uses it by default.

Reasoning

The root cause is the distributed softmax computation across tensor-parallel ranks. In the standard path, softmax values for tokens with very low probability become zero in float16/bfloat16, and `log(0) = -inf`. The log-softmax path computes `log(exp(x - max(x)) / sum(exp(x - max(x))))` directly, which simplifies to `x - max(x) - log(sum(exp(x - max(x))))`, avoiding the zero-probability intermediate. The extra VRAM cost comes from needing to store both the log-softmax result and its exponentiation for the backward pass.

Code Evidence

Higher stability mode selection from `nemo_aligner/utils/distributed.py:303-313`:

# higher stability uses a more numerically stable distributed log_softmax instead of softmax
# however, it uses more VRAM because there is an unavoidable exp() OP on the entire logits tensor
# some models (like DPO) will get -inf in the resulting logprobs unless you set higher_stability=True
if higher_stability:
    log_softmax_output = _compute_distributed_log_softmax(vocab_parallel_logits)
    log_probs = log_softmax_output.clone()
    softmax_output = log_softmax_output.exp_()
else:
    softmax_output = _compute_distributed_softmax(vocab_parallel_logits)
    # if we only do inference, then do the log in place
    log_probs = softmax_output.log_() if inference_only else softmax_output.log()

PPO actor always uses higher stability from `nemo_aligner/models/nlp/gpt/megatron_gpt_ppo_actor.py:135`:

curr_log_probs = from_parallel_logits_to_logprobs(
    vocab_parallel_logits=parallel_logits, target=tokens, higher_stability=True
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment