Workflow:NVIDIA NeMo Aligner REINFORCE Training

Knowledge Sources	NeMo-Aligner NeMo Aligner REINFORCE Guide NeMo-Aligner Paper
Domains	LLMs, RLHF, Model_Alignment, Reinforcement_Learning
Last Updated	2026-02-07 22:00 GMT

Overview

End-to-end REINFORCE-based alignment pipeline that optimizes a language model policy using reward model feedback, offering a simpler alternative to PPO by eliminating the critic network.

Description

This workflow implements the REINFORCE algorithm for language model alignment, a streamlined variant of RLHF that does not require a separate critic (value) network. Unlike PPO which uses an actor-critic architecture with four models, REINFORCE operates with only three: the policy (actor), the reward model, and the reference policy. The reward model provides sequence-level rewards, and the REINFORCE algorithm directly uses these rewards (with optional baseline subtraction via RLOO) to compute policy gradients. This reduces architectural complexity and compute requirements compared to PPO while achieving competitive alignment quality. The policy and reward model run as separate processes communicating via PyTriton HTTP. Optional TensorRT-LLM acceleration is supported for faster generation.

Key outputs:

A REINFORCE-aligned actor model checkpoint
Training metrics including rewards, KL divergence, and policy loss

Scope:

From a trained SFT model, a trained reward model, and prompt data to a policy-optimized aligned model

Usage

Execute this workflow after completing SFT training and reward model training. Choose REINFORCE over PPO when you want a simpler RLHF architecture without the critic network, reducing compute overhead and configuration complexity. REINFORCE was used to train Llama-3.1-Nemotron-70B-Instruct, demonstrating its effectiveness at scale.

Execution Steps

Step 1: Prepare prompt dataset

Format the RLHF training data as JSONL files containing prompts only (no responses). Prompts must follow the same template format used during SFT training. The actor will generate responses during the rollout phase, which will then be scored by the reward model. Create separate train and validation prompt files.

Key considerations:

Same prompt format requirements as PPO RLHF
Data is processed using build_train_valid_test_rlhf_datasets
The Anthropic-HH-RLHF dataset is commonly used for training

Step 2: Launch reward model server

Start the reward model inference server using serve_reward_model.py. Unlike PPO which requires a combined critic+RM server, REINFORCE only needs a reward model server. The server loads the trained reward model, freezes its weights, and exposes an inference endpoint via PyTriton. The server tokenizes incoming prompt-response pairs and returns scalar reward scores.

What happens:

The trained reward model is loaded and frozen
A PyTriton HTTP server is started on the configured port
The server accepts batched inference requests and returns rewards
No critic initialization is needed (unlike PPO)

Step 3: Launch actor and reference policy training

Start the REINFORCE actor training process using train_gpt_reinforce_actor.py. This process loads the SFT model as the actor, saves initial policy weights for KL divergence computation, creates a RemoteGPTRMClient to communicate with the reward model server, and initializes the ReinforceTrainer.

What happens:

The SFT model is loaded as the trainable REINFORCE actor
Initial policy weights are saved for KL penalty computation
A remote client connects to the reward model server
PEFT/LoRA can optionally be applied for memory efficiency
TensorRT-LLM can be enabled for accelerated generation

Step 4: Execute REINFORCE training loop

The training loop alternates between rollout and optimization phases. During rollout, the actor generates responses to sampled prompts and the reward model scores them. During optimization, REINFORCE computes policy gradients using the reward signal, with KL penalty against the reference policy to prevent divergence. The RLOO (REINFORCE Leave-One-Out) variant uses other samples in the batch as a baseline to reduce gradient variance.

Rollout phase:

Actor generates multiple response completions per prompt
Responses are sent to the reward model server for scoring
Log probabilities are computed for the generated tokens
Reference policy log probabilities are computed for KL penalty

Optimization phase:

Rewards are normalized and baseline-subtracted (RLOO)
Policy gradient is computed using the REINFORCE estimator
KL divergence penalty keeps the actor close to the reference policy
Gradients are synchronized across distributed workers

Step 5: Monitor and checkpoint

Monitor training metrics including mean reward, KL divergence, and policy loss. Checkpoints are saved at configured intervals. The training uses Slurm hetjob scripts to coordinate the actor and reward model server on separate node allocations. After training completes, the aligned actor checkpoint can be used for inference.

Key considerations:

Mean reward should increase over the course of training
KL divergence should remain bounded
TRT-LLM acceleration with resharding can improve generation throughput
The use_flask option enables load balancing across DP workers

Execution Diagram

GitHub URL

Workflow Repository