Principle:NVIDIA NeMo Aligner Reward Model Architecture Selection
| Principle Metadata | |
|---|---|
| Type | Principle |
| Domains | NLP, Model_Architecture |
| Last Updated | 2026-02-07 00:00 GMT |
| Related Implementation | Implementation:NVIDIA_NeMo_Aligner_Reward_Model_Class_Registry |
Overview
Strategy for choosing between binary ranking and regression reward model architectures based on the alignment objective.
Description
NeMo Aligner supports two reward model architectures:
- Binary Ranking (Bradley-Terry) — Learns a scalar reward from pairwise human preferences using ranking loss.
- Regression — Directly predicts continuous attribute scores (helpfulness, safety, etc.) using MSE loss.
The selection is made via a type registry that maps configuration strings to model classes and their corresponding dataset builders. This architecture decision affects the loss function, data format, and downstream RLHF behavior.
The registry pattern decouples model selection from training code, enabling extensibility without modifying the core training loop.
Usage
Use binary_ranking when you have pairwise preference data (chosen vs rejected). Use regression when you have multi-attribute continuous labels (e.g., helpfulness score 0–5, safety score 0–5). The choice is configured via model.reward_model_type in the training YAML.
Configuration example:
model:
reward_model_type: binary_ranking # or "regression"
Decision criteria:
- Binary Ranking — Best when human annotators provide pairwise comparisons (A is better than B)
- Regression — Best when annotators provide numerical scores on one or more attributes
Theoretical Basis
The two architectures correspond to different loss functions:
Binary Ranking uses the Bradley-Terry loss:
L = -log(sigma(r_chosen - r_rejected))
Regression uses MSE loss:
L = ||r_predicted - r_target||^2
The registry pattern decouples model selection from training code, enabling extensibility. New reward model types can be added by registering a new class and dataset builder without modifying the training loop.