Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ContextualAI HALOs Reward Model Configuration

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

A model architecture pattern that adds a binary classification head to a pre-trained language model for learning to score text quality as a reward signal.

Description

Reward model configuration converts a standard causal language model into a sequence classifier that predicts which of two responses a human would prefer. The key architectural change is replacing the language modeling head with a binary classification head (num_labels=2) that outputs two logits per sequence.

In the Bradley-Terry framework, the reward for a response is taken as the logit for the positive class (index 1). The model is trained on paired preferences where the chosen response should receive a higher score than the rejected response.

This configuration principle governs how the model is initialized and how the classification head relates to the pre-trained backbone. The padding token must be explicitly configured since classification models use it differently than generative models.

Usage

Use this configuration when initializing a new reward model for Bradley-Terry training. The resulting model is used as a reward scorer in the online iterative alignment loop (feedback labeling step).

Theoretical Basis

The reward model architecture maps a sequence to a scalar reward:

rϕ(x,y)=fϕ([x;y])1

Where fϕ is a sequence classification model with num_labels=2 and the subscript 1 selects the positive-class logit. The model processes the concatenated prompt-response sequence through the transformer backbone and pools the final hidden state through the classification head.

The binary classification setup (rather than a single regression output) provides better training dynamics and allows the model to express uncertainty through the relative magnitude of the two logits.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment