Workflow:NVIDIA NeMo Aligner RLHF PPO Training

Knowledge Sources	NeMo-Aligner NeMo Aligner RLHF Guide NeMo-Aligner Paper PPO Algorithm
Domains	LLMs, RLHF, Model_Alignment, Reinforcement_Learning
Last Updated	2026-02-07 22:00 GMT

Overview

End-to-end Reinforcement Learning from Human Feedback (RLHF) pipeline using Proximal Policy Optimization (PPO) with a distributed multi-process architecture spanning actor, critic, reward model, and reference policy.

Description

This workflow implements the complete PPO-based RLHF training pipeline for aligning language models. It orchestrates four conceptual models across two separate processes: (1) the PPO Actor (policy being trained) co-located with the Initial Policy (reference model for KL divergence), and (2) the PPO Critic (value network) co-located with the Reward Model. Communication between the two processes happens via PyTriton HTTP servers. The training loop alternates between a rollout phase (generating responses and collecting rewards) and an optimization phase (updating the actor using PPO with clipped surrogate loss, advantage estimation via GAE, and KL penalty against the reference policy). Optional TensorRT-LLM acceleration can significantly speed up the generation phase.

Key outputs:

An RLHF-aligned actor model checkpoint
Training metrics including rewards, KL divergence, and PPO loss components

Scope:

From a trained SFT model, a trained reward model, and prompt data to a policy-optimized aligned model

Usage

Execute this workflow after completing both SFT training and reward model training. You need an SFT-trained model (to initialize actor and reference policy), a trained reward model (to initialize critic and provide reward signal), and a dataset of prompts (without responses). PPO RLHF is the most powerful but also most complex alignment technique, suitable when you need fine-grained control over the alignment process and have sufficient compute resources to run multi-process training.

Execution Steps

Step 1: Prepare prompt dataset

Format the RLHF training data as a JSONL file containing prompts only (no responses). The actor model will generate responses during the rollout phase. Prompts should follow the same template format used during SFT training. Create separate train and validation prompt files.

Key considerations:

Only prompts are needed; responses are generated by the actor during rollouts
The prompt template must match the one used during SFT
The Anthropic-HH-RLHF dataset is a common starting point
Data is processed using build_train_valid_test_rlhf_datasets

Step 2: Launch reward model and critic server

Start the combined reward model and critic server process using serve_ppo_critic.py. This process loads the trained reward model checkpoint, initializes the critic network from the reward model weights, and exposes both as PyTriton HTTP endpoints. The critic provides per-token value estimates while the reward model provides sequence-level rewards.

What happens:

The trained reward model is loaded and the critic is initialized from it
A PyTriton server is started exposing inference endpoints
CPU weight offloading is used to manage memory when combining RM and critic
The server waits for HTTP requests from the actor process

Step 3: Launch actor and reference policy training

Start the PPO actor training process using train_gpt_ppo_actor.py. This process loads the SFT model as the actor, saves a copy of the initial weights as the reference policy, builds the prompt dataloader, creates a RemoteGPTRMCriticClient to communicate with the critic/RM server, and initializes the PPOTrainer.

What happens:

The SFT model is loaded as the trainable PPO actor
The initial policy state dict is saved for KL divergence computation
A remote client connects to the critic/RM server via HTTP
PEFT/LoRA can be applied for memory-efficient training

Step 4: Execute PPO training loop

The PPO training loop alternates between rollout and optimization phases. In the rollout phase, the actor generates responses to prompts, the critic provides value estimates, and the reward model scores the responses. In the optimization phase, Generalized Advantage Estimation (GAE) computes advantages, and the actor is updated using the PPO clipped surrogate objective with entropy bonus and KL penalty.

Rollout phase:

Actor generates responses using sampling parameters
Generated responses are sent to the critic/RM server for scoring
The reward model returns sequence-level rewards
The critic returns per-token value estimates
Log probabilities are computed for the generated tokens

Optimization phase:

GAE computes advantages using rewards and value estimates
The PPO loss clips the policy ratio to prevent large updates
KL divergence penalty keeps the actor close to the reference policy
The critic is updated to better predict future rewards
Gradients are synchronized across distributed workers

Step 5: Monitor and checkpoint

Monitor training metrics including mean reward, KL divergence, policy loss, value loss, and entropy. Checkpoints are saved at configured intervals. After training completes, the final actor checkpoint can be used for inference or evaluation.

Key considerations:

Mean reward should generally increase during training
KL divergence should remain bounded (not diverge)
Response length should be monitored to prevent reward hacking
TensorRT-LLM acceleration can be enabled for faster generation
Slurm hetjob scripts coordinate both processes on separate node allocations

Execution Diagram

GitHub URL

Workflow Repository