Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:NVIDIA NeMo Aligner Reward Model Training

From Leeroopedia


Knowledge Sources
Domains LLMs, Reward_Modeling, Model_Alignment
Last Updated 2026-02-07 22:00 GMT

Overview

End-to-end process for training a reward model from human preference data, producing a model that scores response quality for use in RLHF pipelines.

Description

This workflow trains a reward model that learns to assign scalar quality scores to prompt-response pairs based on human preference rankings. Two reward model types are supported: binary ranking (pairwise comparison loss as in InstructGPT) and regression (MSE loss for multi-attribute numeric labels). The trained reward model is a critical component of the RLHF pipeline, where it provides the reward signal that guides policy optimization via PPO or REINFORCE. The reward model is initialized from a pretrained or SFT model and trained using the SupervisedTrainer with comparison-format data.

Key outputs:

  • A trained reward model checkpoint (megatron_gpt.nemo)
  • The model can be deployed as a PyTriton inference server for RLHF

Scope:

  • From a pretrained/SFT .nemo checkpoint and pairwise comparison data to a deployable reward model

Usage

Execute this workflow after completing SFT training and before starting PPO or REINFORCE-based RLHF. You need a dataset of prompt-response pairs where responses are ranked by quality (chosen vs rejected). The trained reward model will serve as the scoring function during reinforcement learning.

Execution Steps

Step 1: Prepare comparison dataset

Format preference data into JSONL files where consecutive pairs represent chosen (good) and rejected (bad) responses to the same prompt. For binary ranking models, each pair consists of prompt || good_response followed by prompt || bad_response. For regression models, each example includes multi-attribute numeric labels.

Key considerations:

  • For binary ranking: pairs must be ordered with chosen response first, rejected second
  • The text field concatenates prompt and response with the model's expected template
  • Regression models require numeric attribute labels per data point
  • Create separate files for train and validation splits

Step 2: Select reward model type

Choose between binary ranking and regression reward model architectures based on your data and use case. Binary ranking models learn from pairwise comparisons using the Bradley-Terry loss, while regression models fit multi-attribute scores using MSE loss. The model type is configured via model.reward_model_type.

Key considerations:

  • Binary ranking is the standard choice for RLHF (InstructGPT-style)
  • Regression supports multi-attribute scoring (e.g., helpfulness, safety, coherence)
  • For regression with Bradley-Terry loss variants, micro_batch_size must be 2
  • The model class is selected automatically based on the type configuration

Step 3: Configure and launch training

Set up the training configuration including the pretrained checkpoint path, data paths, batch sizes, and training hyperparameters. The training script loads the base model with a reward head, initializes distributed training, builds reward model datasets, and runs the SupervisedTrainer loop with comparison-based loss.

What happens:

  • The pretrained model is loaded and augmented with a reward prediction head
  • The reward head maps hidden states to a scalar reward value
  • Training optimizes the model to assign higher rewards to chosen responses
  • Validation accuracy tracks the percentage of correctly ranked pairs

Step 4: Validate reward model quality

Monitor validation accuracy during training to ensure the model learns meaningful preference rankings. The validation accuracy should increase as training progresses. For binary ranking models, accuracy represents the fraction of pairs where the chosen response receives a higher reward than the rejected response.

Key considerations:

  • Typical validation accuracy ranges from 65-75% depending on data quality
  • Overfitting can occur with small datasets; monitor train vs validation metrics
  • The reward model should generalize beyond the specific responses it was trained on

Step 5: Export for RLHF deployment

After training, the reward model checkpoint (megatron_gpt.nemo) is saved and can be deployed in two ways: as a standalone PyTriton inference server (for PPO with separate critic) or co-located with the critic model (for combined RM+Critic serving). The deployment configuration depends on the RLHF architecture chosen.

Key considerations:

  • For PPO: the reward model initializes the critic network and serves rewards
  • For REINFORCE: only the reward model server is needed (no critic)
  • The serve_reward_model.py script launches the PyTriton inference server
  • CPU weight offloading can be used when co-locating RM and critic

Execution Diagram

GitHub URL

Workflow Repository