Workflow:NVIDIA NeMo Aligner Reward Model Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reward_Modeling, Model_Alignment |
| Last Updated | 2026-02-07 22:00 GMT |
Overview
End-to-end process for training a reward model from human preference data, producing a model that scores response quality for use in RLHF pipelines.
Description
This workflow trains a reward model that learns to assign scalar quality scores to prompt-response pairs based on human preference rankings. Two reward model types are supported: binary ranking (pairwise comparison loss as in InstructGPT) and regression (MSE loss for multi-attribute numeric labels). The trained reward model is a critical component of the RLHF pipeline, where it provides the reward signal that guides policy optimization via PPO or REINFORCE. The reward model is initialized from a pretrained or SFT model and trained using the SupervisedTrainer with comparison-format data.
Key outputs:
- A trained reward model checkpoint (megatron_gpt.nemo)
- The model can be deployed as a PyTriton inference server for RLHF
Scope:
- From a pretrained/SFT .nemo checkpoint and pairwise comparison data to a deployable reward model
Usage
Execute this workflow after completing SFT training and before starting PPO or REINFORCE-based RLHF. You need a dataset of prompt-response pairs where responses are ranked by quality (chosen vs rejected). The trained reward model will serve as the scoring function during reinforcement learning.
Execution Steps
Step 1: Prepare comparison dataset
Format preference data into JSONL files where consecutive pairs represent chosen (good) and rejected (bad) responses to the same prompt. For binary ranking models, each pair consists of prompt || good_response followed by prompt || bad_response. For regression models, each example includes multi-attribute numeric labels.
Key considerations:
- For binary ranking: pairs must be ordered with chosen response first, rejected second
- The text field concatenates prompt and response with the model's expected template
- Regression models require numeric attribute labels per data point
- Create separate files for train and validation splits
Step 2: Select reward model type
Choose between binary ranking and regression reward model architectures based on your data and use case. Binary ranking models learn from pairwise comparisons using the Bradley-Terry loss, while regression models fit multi-attribute scores using MSE loss. The model type is configured via model.reward_model_type.
Key considerations:
- Binary ranking is the standard choice for RLHF (InstructGPT-style)
- Regression supports multi-attribute scoring (e.g., helpfulness, safety, coherence)
- For regression with Bradley-Terry loss variants, micro_batch_size must be 2
- The model class is selected automatically based on the type configuration
Step 3: Configure and launch training
Set up the training configuration including the pretrained checkpoint path, data paths, batch sizes, and training hyperparameters. The training script loads the base model with a reward head, initializes distributed training, builds reward model datasets, and runs the SupervisedTrainer loop with comparison-based loss.
What happens:
- The pretrained model is loaded and augmented with a reward prediction head
- The reward head maps hidden states to a scalar reward value
- Training optimizes the model to assign higher rewards to chosen responses
- Validation accuracy tracks the percentage of correctly ranked pairs
Step 4: Validate reward model quality
Monitor validation accuracy during training to ensure the model learns meaningful preference rankings. The validation accuracy should increase as training progresses. For binary ranking models, accuracy represents the fraction of pairs where the chosen response receives a higher reward than the rejected response.
Key considerations:
- Typical validation accuracy ranges from 65-75% depending on data quality
- Overfitting can occur with small datasets; monitor train vs validation metrics
- The reward model should generalize beyond the specific responses it was trained on
Step 5: Export for RLHF deployment
After training, the reward model checkpoint (megatron_gpt.nemo) is saved and can be deployed in two ways: as a standalone PyTriton inference server (for PPO with separate critic) or co-located with the critic model (for combined RM+Critic serving). The deployment configuration depends on the RLHF architecture chosen.
Key considerations:
- For PPO: the reward model initializes the critic network and serves rewards
- For REINFORCE: only the reward model server is needed (no critic)
- The serve_reward_model.py script launches the PyTriton inference server
- CPU weight offloading can be used when co-locating RM and critic