Principle:NVIDIA NeMo Aligner PPO Actor Critic Setup
| Principle: PPO Actor Critic Setup | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Reinforcement_Learning, NLP |
| Related | Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Actor_And_Critic_Client |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Initialization of the actor model and remote critic client for distributed PPO training.
Description
PPO training requires an actor model (the policy being optimized) and a client to communicate with the remote critic/reward model server. The actor model extends a pretrained GPT model with PPO-specific capabilities:
- Response generation -- sampling completions given prompts
- Log-probability computation -- computing log pi_theta(y|x) for policy gradient
- Entropy calculation -- for the entropy bonus term in the PPO objective
- PPO clipped ratio loss -- the core optimization objective
The remote client abstracts HTTP communication with the critic server, providing methods for:
- Inference -- get value estimates and rewards for generated sequences
- Training -- send computed returns and advantages for critic weight updates
A reference policy (frozen copy of initial weights) is maintained for KL divergence computation. The reference weights are stored in CPU memory to avoid doubling GPU memory usage.
Usage
Use when setting up PPO training. The actor is loaded from a pretrained checkpoint and the critic client connects to a running critic server via HTTP.
- The reference policy weights are stored in CPU memory for KL penalty computation during rollouts.
- The actor must be initialized after the critic server is running and reachable.
- Supports both full-parameter training and PEFT/LoRA configurations.
Theoretical Basis
PPO optimizes the clipped surrogate objective:
L^CLIP = E[ min( r(theta) * A, clip(r(theta), 1 - epsilon, 1 + epsilon) * A ) ]
where:
r(theta) = pi_theta(a|s) / pi_old(a|s) -- probability ratio
A = advantage estimate from GAE
The actor computes pi_theta (current log probabilities), the reference policy provides pi_ref for the KL penalty term, and the critic provides value estimates V(s) for advantage computation:
total_reward = reward - beta * KL(pi_theta || pi_ref)
Pseudo-code
FUNCTION setup_ppo_actor_and_critic_client(pretrained_model, critic_server_url):
# Load the actor model with PPO capabilities
actor = load_pretrained_model(pretrained_model)
actor = extend_with_ppo_head(actor)
# Store reference policy weights on CPU
reference_policy = copy_state_dict_to_cpu(actor)
# Connect to remote critic server
critic_client = create_http_client(critic_server_url)
critic_client.register_endpoints(["infer", "train", "save"])
RETURN actor, critic_client, reference_policy
Related Pages
- Implementation:NVIDIA_NeMo_Aligner_MegatronGPT_Actor_And_Critic_Client
- Heuristic:NVIDIA_NeMo_Aligner_PPO_Critic_Warmup_Tip