Heuristic:NVIDIA NeMo Aligner PPO Critic Warmup Tip

Knowledge Sources	NeMo-Aligner NVIDIA Internal
Domains	Optimization, PPO, Training_Stability
Last Updated	2026-02-07 22:00 GMT

Overview

Training stability technique that pre-trains the PPO critic for N steps before starting policy optimization, preventing critic loss from hockey-sticking due to poor initial value estimates.

Description

In PPO RLHF training, the critic (value function) is typically initialized from the reward model weights. However, the reward model is trained to predict absolute rewards, while the critic needs to predict expected cumulative returns under the current policy. This mismatch means the critic's initial value estimates are poor, leading to high advantages, large policy updates, and a runaway "hockey-stick" pattern in the critic loss. Critic warmup addresses this by training the critic for several additional steps on the first batch of rollout data before the policy starts being updated, allowing the critic to calibrate its value predictions.

Usage

Use this heuristic when starting PPO training and observing the critic loss diverging or spiking in early iterations. Set `trainer.ppo.critic_warmup_steps` to a small positive integer (e.g., 5-20). Note that setting this to N means the critic will be trained N+1 times on the first iteration.

The Insight (Rule of Thumb)

Action: Set `trainer.ppo.critic_warmup_steps` to 5-20 in the PPO config.
Value: Default is `0` (no warmup). Common values: 5-20 steps.
Trade-off: Slightly longer first iteration in exchange for more stable critic loss trajectory throughout training.
Note: Setting to N means the critic trains N+1 times on the first iteration (N warmup + 1 normal).

Reasoning

The critic is initialized from the reward model, which predicts single-step rewards. The critic in PPO needs to predict the discounted sum of future rewards (returns), which is a fundamentally different target. Without warmup, the critic produces inaccurate value baselines, causing the advantage estimates to be noisy and large. This leads to overly aggressive policy updates that destabilize training. A few warmup steps allow the critic to adjust its value predictions to the actual return distribution before the policy starts changing.

Code Evidence

Critic warmup configuration from `examples/nlp/gpt/conf/gpt_ppo_actor.yaml:12-18`:

ppo:
  # How many steps we train warmup the critic for (without training the policy)
  # this may help prevent the critic loss from hockey sticking since
  # the critic is initialized from the reward model and may not be initially
  # good at estimating the returns of the policy.
  # NOTE: setting this to N means the critic will be trained N + 1 times on the first
  # iteration.
  critic_warmup_steps: 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment