Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Aligner Reward Model Architecture Selection

From Leeroopedia


Principle Metadata
Type Principle
Domains NLP, Model_Architecture
Last Updated 2026-02-07 00:00 GMT
Related Implementation Implementation:NVIDIA_NeMo_Aligner_Reward_Model_Class_Registry

Overview

Strategy for choosing between binary ranking and regression reward model architectures based on the alignment objective.

Description

NeMo Aligner supports two reward model architectures:

  • Binary Ranking (Bradley-Terry) — Learns a scalar reward from pairwise human preferences using ranking loss.
  • Regression — Directly predicts continuous attribute scores (helpfulness, safety, etc.) using MSE loss.

The selection is made via a type registry that maps configuration strings to model classes and their corresponding dataset builders. This architecture decision affects the loss function, data format, and downstream RLHF behavior.

The registry pattern decouples model selection from training code, enabling extensibility without modifying the core training loop.

Usage

Use binary_ranking when you have pairwise preference data (chosen vs rejected). Use regression when you have multi-attribute continuous labels (e.g., helpfulness score 0–5, safety score 0–5). The choice is configured via model.reward_model_type in the training YAML.

Configuration example:

model:
  reward_model_type: binary_ranking  # or "regression"

Decision criteria:

  • Binary Ranking — Best when human annotators provide pairwise comparisons (A is better than B)
  • Regression — Best when annotators provide numerical scores on one or more attributes

Theoretical Basis

The two architectures correspond to different loss functions:

Binary Ranking uses the Bradley-Terry loss:

L = -log(sigma(r_chosen - r_rejected))

Regression uses MSE loss:

L = ||r_predicted - r_target||^2

The registry pattern decouples model selection from training code, enabling extensibility. New reward model types can be added by registering a new class and dataset builder without modifying the training loop.

Related Pages

Knowledge Sources

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment