Principle:NVIDIA NeMo Aligner Reward Model Architecture Selection

Principle Metadata
Type	Principle
Domains	NLP, Model_Architecture
Last Updated	2026-02-07 00:00 GMT
Related Implementation	Implementation:NVIDIA_NeMo_Aligner_Reward_Model_Class_Registry

Overview

Strategy for choosing between binary ranking and regression reward model architectures based on the alignment objective.

Description

NeMo Aligner supports two reward model architectures:

Binary Ranking (Bradley-Terry) — Learns a scalar reward from pairwise human preferences using ranking loss.
Regression — Directly predicts continuous attribute scores (helpfulness, safety, etc.) using MSE loss.

The selection is made via a type registry that maps configuration strings to model classes and their corresponding dataset builders. This architecture decision affects the loss function, data format, and downstream RLHF behavior.

The registry pattern decouples model selection from training code, enabling extensibility without modifying the core training loop.

Usage

Use binary_ranking when you have pairwise preference data (chosen vs rejected). Use regression when you have multi-attribute continuous labels (e.g., helpfulness score 0–5, safety score 0–5). The choice is configured via model.reward_model_type in the training YAML.

Configuration example:

model:
  reward_model_type: binary_ranking  # or "regression"

Decision criteria:

Binary Ranking — Best when human annotators provide pairwise comparisons (A is better than B)
Regression — Best when annotators provide numerical scores on one or more attributes

Theoretical Basis

The two architectures correspond to different loss functions:

Binary Ranking uses the Bradley-Terry loss:

L = -log(sigma(r_chosen - r_rejected))

Regression uses MSE loss:

L = ||r_predicted - r_target||^2

The registry pattern decouples model selection from training code, enabling extensibility. New reward model types can be added by registering a new class and dataset builder without modifying the training loop.

Related Pages

Implementation:NVIDIA_NeMo_Aligner_Reward_Model_Class_Registry

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment