Workflow:NVIDIA NeMo Aligner REINFORCE Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, Model_Alignment, Reinforcement_Learning |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
End-to-end REINFORCE-based alignment pipeline that optimizes a language model policy using reward model feedback, offering a simpler alternative to PPO by eliminating the critic network.
Description
This workflow implements the REINFORCE algorithm for language model alignment, a streamlined variant of RLHF that does not require a separate critic (value) network. Unlike PPO which uses an actor-critic architecture with four models, REINFORCE operates with only three: the policy (actor), the reward model, and the reference policy. The reward model provides sequence-level rewards, and the REINFORCE algorithm directly uses these rewards (with optional baseline subtraction via RLOO) to compute policy gradients. This reduces architectural complexity and compute requirements compared to PPO while achieving competitive alignment quality. The policy and reward model run as separate processes communicating via PyTriton HTTP. Optional TensorRT-LLM acceleration is supported for faster generation.
Key outputs:
- A REINFORCE-aligned actor model checkpoint
- Training metrics including rewards, KL divergence, and policy loss
Scope:
- From a trained SFT model, a trained reward model, and prompt data to a policy-optimized aligned model
Usage
Execute this workflow after completing SFT training and reward model training. Choose REINFORCE over PPO when you want a simpler RLHF architecture without the critic network, reducing compute overhead and configuration complexity. REINFORCE was used to train Llama-3.1-Nemotron-70B-Instruct, demonstrating its effectiveness at scale.
Execution Steps
Step 1: Prepare prompt dataset
Format the RLHF training data as JSONL files containing prompts only (no responses). Prompts must follow the same template format used during SFT training. The actor will generate responses during the rollout phase, which will then be scored by the reward model. Create separate train and validation prompt files.
Key considerations:
- Same prompt format requirements as PPO RLHF
- Data is processed using build_train_valid_test_rlhf_datasets
- The Anthropic-HH-RLHF dataset is commonly used for training
Step 2: Launch reward model server
Start the reward model inference server using serve_reward_model.py. Unlike PPO which requires a combined critic+RM server, REINFORCE only needs a reward model server. The server loads the trained reward model, freezes its weights, and exposes an inference endpoint via PyTriton. The server tokenizes incoming prompt-response pairs and returns scalar reward scores.
What happens:
- The trained reward model is loaded and frozen
- A PyTriton HTTP server is started on the configured port
- The server accepts batched inference requests and returns rewards
- No critic initialization is needed (unlike PPO)
Step 3: Launch actor and reference policy training
Start the REINFORCE actor training process using train_gpt_reinforce_actor.py. This process loads the SFT model as the actor, saves initial policy weights for KL divergence computation, creates a RemoteGPTRMClient to communicate with the reward model server, and initializes the ReinforceTrainer.
What happens:
- The SFT model is loaded as the trainable REINFORCE actor
- Initial policy weights are saved for KL penalty computation
- A remote client connects to the reward model server
- PEFT/LoRA can optionally be applied for memory efficiency
- TensorRT-LLM can be enabled for accelerated generation
Step 4: Execute REINFORCE training loop
The training loop alternates between rollout and optimization phases. During rollout, the actor generates responses to sampled prompts and the reward model scores them. During optimization, REINFORCE computes policy gradients using the reward signal, with KL penalty against the reference policy to prevent divergence. The RLOO (REINFORCE Leave-One-Out) variant uses other samples in the batch as a baseline to reduce gradient variance.
Rollout phase:
- Actor generates multiple response completions per prompt
- Responses are sent to the reward model server for scoring
- Log probabilities are computed for the generated tokens
- Reference policy log probabilities are computed for KL penalty
Optimization phase:
- Rewards are normalized and baseline-subtracted (RLOO)
- Policy gradient is computed using the REINFORCE estimator
- KL divergence penalty keeps the actor close to the reference policy
- Gradients are synchronized across distributed workers
Step 5: Monitor and checkpoint
Monitor training metrics including mean reward, KL divergence, and policy loss. Checkpoints are saved at configured intervals. The training uses Slurm hetjob scripts to coordinate the actor and reward model server on separate node allocations. After training completes, the aligned actor checkpoint can be used for inference.
Key considerations:
- Mean reward should increase over the course of training
- KL divergence should remain bounded
- TRT-LLM acceleration with resharding can improve generation throughput
- The use_flask option enables load balancing across DP workers