Principle:NVIDIA NeMo Aligner PPO Training

Principle: PPO Training
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	Reinforcement_Learning, NLP
Related	Implementation:NVIDIA_NeMo_Aligner_PPOTrainer_Fit
Last Updated	2026-02-07 00:00 GMT

Overview

Proximal Policy Optimization training loop for aligning language models with human preferences using online reinforcement learning.

Description

PPO training alternates between two phases:

(1) Rollout Phase:

The actor generates responses to prompts from the training dataset.
The critic/RM server scores them with rewards and value estimates.
Advantages are computed using Generalized Advantage Estimation (GAE).

(2) Optimization Phase:

The actor is updated using the PPO clipped surrogate objective.
The critic is updated to better predict future rewards.
Multiple PPO epochs are run on the same batch of collected rollout data.

The training loop manages:

Distributed rollout generation across model-parallel and data-parallel ranks
Multi-epoch PPO updates on collected data for improved sample efficiency
KL penalty between current and reference policies to prevent reward hacking
Coordinated training of both actor and critic across separate processes

Usage

Use for RLHF alignment with human preferences when you have a trained reward model and want online policy optimization.

PPO provides better sample efficiency than REINFORCE through the critic's value estimates.
Requires more infrastructure than REINFORCE (separate critic server process).
Prefer DPO for simpler setups without online generation.
Requires a running critic server and a reward model (often co-located in the critic).

Theoretical Basis

The PPO clipped surrogate objective:

L^CLIP = E[ min( r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t ) ]
         + c1 * L^VF
         + c2 * S[pi_theta]

where:
  r_t(theta) = pi_theta(a_t|s_t) / pi_old(a_t|s_t)
  L^VF = value function loss (critic update)
  S[pi_theta] = entropy bonus

Generalized Advantage Estimation (GAE):

A_t = sum over l of (gamma * lambda)^l * delta_{t+l}

where:
  delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

KL penalty applied to rewards:

total_reward = reward - beta * KL(pi_theta || pi_ref)

Pseudo-code

FUNCTION ppo_training_loop(actor, critic_client, ref_policy, dataloader, config):
    FOR each training step:
        # --- Rollout Phase ---
        prompts = sample_batch(dataloader)
        responses = actor.generate(prompts)
        values, rewards = critic_client.infer(prompts, responses)
        ref_log_probs = compute_log_probs(ref_policy, prompts, responses)
        kl_penalty = compute_kl(actor_log_probs, ref_log_probs)
        adjusted_rewards = rewards - beta * kl_penalty
        advantages = compute_gae(adjusted_rewards, values, gamma, lambda)
        returns = advantages + values

        # --- Optimization Phase ---
        FOR each ppo_epoch in range(num_ppo_epochs):
            actor_loss = compute_ppo_clipped_loss(actor, responses, advantages)
            update_actor(actor, actor_loss)

        critic_client.train(returns, advantages)

    RETURN actor

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment