Principle:NVIDIA NeMo Aligner PPO Actor Critic Setup

Principle: PPO Actor Critic Setup
Type	Principle
Project	NVIDIA NeMo Aligner
Domains	Reinforcement_Learning, NLP
Related	Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Actor_And_Critic_Client
Last Updated	2026-02-07 00:00 GMT

Overview

Initialization of the actor model and remote critic client for distributed PPO training.

Description

PPO training requires an actor model (the policy being optimized) and a client to communicate with the remote critic/reward model server. The actor model extends a pretrained GPT model with PPO-specific capabilities:

Response generation -- sampling completions given prompts
Log-probability computation -- computing log pi_theta(y|x) for policy gradient
Entropy calculation -- for the entropy bonus term in the PPO objective
PPO clipped ratio loss -- the core optimization objective

The remote client abstracts HTTP communication with the critic server, providing methods for:

Inference -- get value estimates and rewards for generated sequences
Training -- send computed returns and advantages for critic weight updates

A reference policy (frozen copy of initial weights) is maintained for KL divergence computation. The reference weights are stored in CPU memory to avoid doubling GPU memory usage.

Usage

Use when setting up PPO training. The actor is loaded from a pretrained checkpoint and the critic client connects to a running critic server via HTTP.

The reference policy weights are stored in CPU memory for KL penalty computation during rollouts.
The actor must be initialized after the critic server is running and reachable.
Supports both full-parameter training and PEFT/LoRA configurations.

Theoretical Basis

PPO optimizes the clipped surrogate objective:

L^CLIP = E[ min( r(theta) * A, clip(r(theta), 1 - epsilon, 1 + epsilon) * A ) ]

where:
  r(theta) = pi_theta(a|s) / pi_old(a|s)    -- probability ratio
  A = advantage estimate from GAE

The actor computes pi_theta (current log probabilities), the reference policy provides pi_ref for the KL penalty term, and the critic provides value estimates V(s) for advantage computation:

total_reward = reward - beta * KL(pi_theta || pi_ref)

Pseudo-code

FUNCTION setup_ppo_actor_and_critic_client(pretrained_model, critic_server_url):
    # Load the actor model with PPO capabilities
    actor = load_pretrained_model(pretrained_model)
    actor = extend_with_ppo_head(actor)

    # Store reference policy weights on CPU
    reference_policy = copy_state_dict_to_cpu(actor)

    # Connect to remote critic server
    critic_client = create_http_client(critic_server_url)
    critic_client.register_endpoints(["infer", "train", "save"])

    RETURN actor, critic_client, reference_policy

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment