Workflow:NVIDIA NeMo Aligner RLHF PPO Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, Model_Alignment, Reinforcement_Learning |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
End-to-end Reinforcement Learning from Human Feedback (RLHF) pipeline using Proximal Policy Optimization (PPO) with a distributed multi-process architecture spanning actor, critic, reward model, and reference policy.
Description
This workflow implements the complete PPO-based RLHF training pipeline for aligning language models. It orchestrates four conceptual models across two separate processes: (1) the PPO Actor (policy being trained) co-located with the Initial Policy (reference model for KL divergence), and (2) the PPO Critic (value network) co-located with the Reward Model. Communication between the two processes happens via PyTriton HTTP servers. The training loop alternates between a rollout phase (generating responses and collecting rewards) and an optimization phase (updating the actor using PPO with clipped surrogate loss, advantage estimation via GAE, and KL penalty against the reference policy). Optional TensorRT-LLM acceleration can significantly speed up the generation phase.
Key outputs:
- An RLHF-aligned actor model checkpoint
- Training metrics including rewards, KL divergence, and PPO loss components
Scope:
- From a trained SFT model, a trained reward model, and prompt data to a policy-optimized aligned model
Usage
Execute this workflow after completing both SFT training and reward model training. You need an SFT-trained model (to initialize actor and reference policy), a trained reward model (to initialize critic and provide reward signal), and a dataset of prompts (without responses). PPO RLHF is the most powerful but also most complex alignment technique, suitable when you need fine-grained control over the alignment process and have sufficient compute resources to run multi-process training.
Execution Steps
Step 1: Prepare prompt dataset
Format the RLHF training data as a JSONL file containing prompts only (no responses). The actor model will generate responses during the rollout phase. Prompts should follow the same template format used during SFT training. Create separate train and validation prompt files.
Key considerations:
- Only prompts are needed; responses are generated by the actor during rollouts
- The prompt template must match the one used during SFT
- The Anthropic-HH-RLHF dataset is a common starting point
- Data is processed using build_train_valid_test_rlhf_datasets
Step 2: Launch reward model and critic server
Start the combined reward model and critic server process using serve_ppo_critic.py. This process loads the trained reward model checkpoint, initializes the critic network from the reward model weights, and exposes both as PyTriton HTTP endpoints. The critic provides per-token value estimates while the reward model provides sequence-level rewards.
What happens:
- The trained reward model is loaded and the critic is initialized from it
- A PyTriton server is started exposing inference endpoints
- CPU weight offloading is used to manage memory when combining RM and critic
- The server waits for HTTP requests from the actor process
Step 3: Launch actor and reference policy training
Start the PPO actor training process using train_gpt_ppo_actor.py. This process loads the SFT model as the actor, saves a copy of the initial weights as the reference policy, builds the prompt dataloader, creates a RemoteGPTRMCriticClient to communicate with the critic/RM server, and initializes the PPOTrainer.
What happens:
- The SFT model is loaded as the trainable PPO actor
- The initial policy state dict is saved for KL divergence computation
- A remote client connects to the critic/RM server via HTTP
- PEFT/LoRA can be applied for memory-efficient training
Step 4: Execute PPO training loop
The PPO training loop alternates between rollout and optimization phases. In the rollout phase, the actor generates responses to prompts, the critic provides value estimates, and the reward model scores the responses. In the optimization phase, Generalized Advantage Estimation (GAE) computes advantages, and the actor is updated using the PPO clipped surrogate objective with entropy bonus and KL penalty.
Rollout phase:
- Actor generates responses using sampling parameters
- Generated responses are sent to the critic/RM server for scoring
- The reward model returns sequence-level rewards
- The critic returns per-token value estimates
- Log probabilities are computed for the generated tokens
Optimization phase:
- GAE computes advantages using rewards and value estimates
- The PPO loss clips the policy ratio to prevent large updates
- KL divergence penalty keeps the actor close to the reference policy
- The critic is updated to better predict future rewards
- Gradients are synchronized across distributed workers
Step 5: Monitor and checkpoint
Monitor training metrics including mean reward, KL divergence, policy loss, value loss, and entropy. Checkpoints are saved at configured intervals. After training completes, the final actor checkpoint can be used for inference or evaluation.
Key considerations:
- Mean reward should generally increase during training
- KL divergence should remain bounded (not diverge)
- Response length should be monitored to prevent reward hacking
- TensorRT-LLM acceleration can be enabled for faster generation
- Slurm hetjob scripts coordinate both processes on separate node allocations