Principle:ContextualAI HALOs Reward Model Configuration

Knowledge Sources	Learning to summarize from human feedback A General Language Assistant as a Laboratory for Alignment ContextualAI HALOs
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

A model architecture pattern that adds a binary classification head to a pre-trained language model for learning to score text quality as a reward signal.

Description

Reward model configuration converts a standard causal language model into a sequence classifier that predicts which of two responses a human would prefer. The key architectural change is replacing the language modeling head with a binary classification head (num_labels=2) that outputs two logits per sequence.

In the Bradley-Terry framework, the reward for a response is taken as the logit for the positive class (index 1). The model is trained on paired preferences where the chosen response should receive a higher score than the rejected response.

This configuration principle governs how the model is initialized and how the classification head relates to the pre-trained backbone. The padding token must be explicitly configured since classification models use it differently than generative models.

Usage

Use this configuration when initializing a new reward model for Bradley-Terry training. The resulting model is used as a reward scorer in the online iterative alignment loop (feedback labeling step).

Theoretical Basis

The reward model architecture maps a sequence to a scalar reward:

$r_{ϕ} (x, y) = f_{ϕ} ([x; y])_{1}$

Where $f_{ϕ}$ is a sequence classification model with num_labels=2 and the subscript 1 selects the positive-class logit. The model processes the concatenated prompt-response sequence through the transformer backbone and pools the final hidden state through the classification head.

The binary classification setup (rather than a single regression output) provides better training dynamics and allows the model to express uncertainty through the relative magnitude of the two logits.

Related Pages

Implemented By

Implementation:ContextualAI_HALOs_AutoModelForBradleyTerry_From_Pretrained

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment