Principle:NVIDIA NeMo Aligner PPO Training
Appearance
| Principle: PPO Training | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Reinforcement_Learning, NLP |
| Related | Implementation:NVIDIA_NeMo_Aligner_PPOTrainer_Fit |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Proximal Policy Optimization training loop for aligning language models with human preferences using online reinforcement learning.
Description
PPO training alternates between two phases:
(1) Rollout Phase:
- The actor generates responses to prompts from the training dataset.
- The critic/RM server scores them with rewards and value estimates.
- Advantages are computed using Generalized Advantage Estimation (GAE).
(2) Optimization Phase:
- The actor is updated using the PPO clipped surrogate objective.
- The critic is updated to better predict future rewards.
- Multiple PPO epochs are run on the same batch of collected rollout data.
The training loop manages:
- Distributed rollout generation across model-parallel and data-parallel ranks
- Multi-epoch PPO updates on collected data for improved sample efficiency
- KL penalty between current and reference policies to prevent reward hacking
- Coordinated training of both actor and critic across separate processes
Usage
Use for RLHF alignment with human preferences when you have a trained reward model and want online policy optimization.
- PPO provides better sample efficiency than REINFORCE through the critic's value estimates.
- Requires more infrastructure than REINFORCE (separate critic server process).
- Prefer DPO for simpler setups without online generation.
- Requires a running critic server and a reward model (often co-located in the critic).
Theoretical Basis
The PPO clipped surrogate objective:
L^CLIP = E[ min( r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t ) ]
+ c1 * L^VF
+ c2 * S[pi_theta]
where:
r_t(theta) = pi_theta(a_t|s_t) / pi_old(a_t|s_t)
L^VF = value function loss (critic update)
S[pi_theta] = entropy bonus
Generalized Advantage Estimation (GAE):
A_t = sum over l of (gamma * lambda)^l * delta_{t+l}
where:
delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)
KL penalty applied to rewards:
total_reward = reward - beta * KL(pi_theta || pi_ref)
Pseudo-code
FUNCTION ppo_training_loop(actor, critic_client, ref_policy, dataloader, config):
FOR each training step:
# --- Rollout Phase ---
prompts = sample_batch(dataloader)
responses = actor.generate(prompts)
values, rewards = critic_client.infer(prompts, responses)
ref_log_probs = compute_log_probs(ref_policy, prompts, responses)
kl_penalty = compute_kl(actor_log_probs, ref_log_probs)
adjusted_rewards = rewards - beta * kl_penalty
advantages = compute_gae(adjusted_rewards, values, gamma, lambda)
returns = advantages + values
# --- Optimization Phase ---
FOR each ppo_epoch in range(num_ppo_epochs):
actor_loss = compute_ppo_clipped_loss(actor, responses, advantages)
update_actor(actor, actor_loss)
critic_client.train(returns, advantages)
RETURN actor
Related Pages
- Implementation:NVIDIA_NeMo_Aligner_PPOTrainer_Fit
- Heuristic:NVIDIA_NeMo_Aligner_Higher_Stability_Log_Probs
- Heuristic:NVIDIA_NeMo_Aligner_Adam_State_Offloading_Tip
- Heuristic:NVIDIA_NeMo_Aligner_PPO_NCCL_Algorithm_Setting
- Heuristic:NVIDIA_NeMo_Aligner_PPO_Critic_Warmup_Tip
Knowledge Sources
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment