Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner PPO Training

From Leeroopedia


Principle: PPO Training
Type Principle
Project NVIDIA NeMo Aligner
Domains Reinforcement_Learning, NLP
Related Implementation:NVIDIA_NeMo_Aligner_PPOTrainer_Fit
Last Updated 2026-02-07 00:00 GMT

Overview

Proximal Policy Optimization training loop for aligning language models with human preferences using online reinforcement learning.

Description

PPO training alternates between two phases:

(1) Rollout Phase:

  • The actor generates responses to prompts from the training dataset.
  • The critic/RM server scores them with rewards and value estimates.
  • Advantages are computed using Generalized Advantage Estimation (GAE).

(2) Optimization Phase:

  • The actor is updated using the PPO clipped surrogate objective.
  • The critic is updated to better predict future rewards.
  • Multiple PPO epochs are run on the same batch of collected rollout data.

The training loop manages:

  • Distributed rollout generation across model-parallel and data-parallel ranks
  • Multi-epoch PPO updates on collected data for improved sample efficiency
  • KL penalty between current and reference policies to prevent reward hacking
  • Coordinated training of both actor and critic across separate processes

Usage

Use for RLHF alignment with human preferences when you have a trained reward model and want online policy optimization.

  • PPO provides better sample efficiency than REINFORCE through the critic's value estimates.
  • Requires more infrastructure than REINFORCE (separate critic server process).
  • Prefer DPO for simpler setups without online generation.
  • Requires a running critic server and a reward model (often co-located in the critic).

Theoretical Basis

The PPO clipped surrogate objective:

L^CLIP = E[ min( r_t(theta) * A_t, clip(r_t(theta), 1 - epsilon, 1 + epsilon) * A_t ) ]
         + c1 * L^VF
         + c2 * S[pi_theta]

where:
  r_t(theta) = pi_theta(a_t|s_t) / pi_old(a_t|s_t)
  L^VF = value function loss (critic update)
  S[pi_theta] = entropy bonus

Generalized Advantage Estimation (GAE):

A_t = sum over l of (gamma * lambda)^l * delta_{t+l}

where:
  delta_t = r_t + gamma * V(s_{t+1}) - V(s_t)

KL penalty applied to rewards:

total_reward = reward - beta * KL(pi_theta || pi_ref)

Pseudo-code

FUNCTION ppo_training_loop(actor, critic_client, ref_policy, dataloader, config):
    FOR each training step:
        # --- Rollout Phase ---
        prompts = sample_batch(dataloader)
        responses = actor.generate(prompts)
        values, rewards = critic_client.infer(prompts, responses)
        ref_log_probs = compute_log_probs(ref_policy, prompts, responses)
        kl_penalty = compute_kl(actor_log_probs, ref_log_probs)
        adjusted_rewards = rewards - beta * kl_penalty
        advantages = compute_gae(adjusted_rewards, values, gamma, lambda)
        returns = advantages + values

        # --- Optimization Phase ---
        FOR each ppo_epoch in range(num_ppo_epochs):
            actor_loss = compute_ppo_clipped_loss(actor, responses, advantages)
            update_actor(actor, actor_loss)

        critic_client.train(returns, advantages)

    RETURN actor

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment