Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:NVIDIA NeMo Aligner REINFORCE Training

From Leeroopedia
Revision as of 11:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/NVIDIA_NeMo_Aligner_REINFORCE_Training.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLMs, RLHF, Model_Alignment, Reinforcement_Learning
Last Updated 2026-02-07 22:00 GMT

Overview

End-to-end REINFORCE-based alignment pipeline that optimizes a language model policy using reward model feedback, offering a simpler alternative to PPO by eliminating the critic network.

Description

This workflow implements the REINFORCE algorithm for language model alignment, a streamlined variant of RLHF that does not require a separate critic (value) network. Unlike PPO which uses an actor-critic architecture with four models, REINFORCE operates with only three: the policy (actor), the reward model, and the reference policy. The reward model provides sequence-level rewards, and the REINFORCE algorithm directly uses these rewards (with optional baseline subtraction via RLOO) to compute policy gradients. This reduces architectural complexity and compute requirements compared to PPO while achieving competitive alignment quality. The policy and reward model run as separate processes communicating via PyTriton HTTP. Optional TensorRT-LLM acceleration is supported for faster generation.

Key outputs:

  • A REINFORCE-aligned actor model checkpoint
  • Training metrics including rewards, KL divergence, and policy loss

Scope:

  • From a trained SFT model, a trained reward model, and prompt data to a policy-optimized aligned model

Usage

Execute this workflow after completing SFT training and reward model training. Choose REINFORCE over PPO when you want a simpler RLHF architecture without the critic network, reducing compute overhead and configuration complexity. REINFORCE was used to train Llama-3.1-Nemotron-70B-Instruct, demonstrating its effectiveness at scale.

Execution Steps

Step 1: Prepare prompt dataset

Format the RLHF training data as JSONL files containing prompts only (no responses). Prompts must follow the same template format used during SFT training. The actor will generate responses during the rollout phase, which will then be scored by the reward model. Create separate train and validation prompt files.

Key considerations:

  • Same prompt format requirements as PPO RLHF
  • Data is processed using build_train_valid_test_rlhf_datasets
  • The Anthropic-HH-RLHF dataset is commonly used for training

Step 2: Launch reward model server

Start the reward model inference server using serve_reward_model.py. Unlike PPO which requires a combined critic+RM server, REINFORCE only needs a reward model server. The server loads the trained reward model, freezes its weights, and exposes an inference endpoint via PyTriton. The server tokenizes incoming prompt-response pairs and returns scalar reward scores.

What happens:

  • The trained reward model is loaded and frozen
  • A PyTriton HTTP server is started on the configured port
  • The server accepts batched inference requests and returns rewards
  • No critic initialization is needed (unlike PPO)

Step 3: Launch actor and reference policy training

Start the REINFORCE actor training process using train_gpt_reinforce_actor.py. This process loads the SFT model as the actor, saves initial policy weights for KL divergence computation, creates a RemoteGPTRMClient to communicate with the reward model server, and initializes the ReinforceTrainer.

What happens:

  • The SFT model is loaded as the trainable REINFORCE actor
  • Initial policy weights are saved for KL penalty computation
  • A remote client connects to the reward model server
  • PEFT/LoRA can optionally be applied for memory efficiency
  • TensorRT-LLM can be enabled for accelerated generation

Step 4: Execute REINFORCE training loop

The training loop alternates between rollout and optimization phases. During rollout, the actor generates responses to sampled prompts and the reward model scores them. During optimization, REINFORCE computes policy gradients using the reward signal, with KL penalty against the reference policy to prevent divergence. The RLOO (REINFORCE Leave-One-Out) variant uses other samples in the batch as a baseline to reduce gradient variance.

Rollout phase:

  • Actor generates multiple response completions per prompt
  • Responses are sent to the reward model server for scoring
  • Log probabilities are computed for the generated tokens
  • Reference policy log probabilities are computed for KL penalty

Optimization phase:

  • Rewards are normalized and baseline-subtracted (RLOO)
  • Policy gradient is computed using the REINFORCE estimator
  • KL divergence penalty keeps the actor close to the reference policy
  • Gradients are synchronized across distributed workers

Step 5: Monitor and checkpoint

Monitor training metrics including mean reward, KL divergence, and policy loss. Checkpoints are saved at configured intervals. The training uses Slurm hetjob scripts to coordinate the actor and reward model server on separate node allocations. After training completes, the aligned actor checkpoint can be used for inference.

Key considerations:

  • Mean reward should increase over the course of training
  • KL divergence should remain bounded
  • TRT-LLM acceleration with resharding can improve generation throughput
  • The use_flask option enables load balancing across DP workers

Execution Diagram

GitHub URL

Workflow Repository