Principle:ContextualAI HALOs Reward Model Configuration
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
A model architecture pattern that adds a binary classification head to a pre-trained language model for learning to score text quality as a reward signal.
Description
Reward model configuration converts a standard causal language model into a sequence classifier that predicts which of two responses a human would prefer. The key architectural change is replacing the language modeling head with a binary classification head (num_labels=2) that outputs two logits per sequence.
In the Bradley-Terry framework, the reward for a response is taken as the logit for the positive class (index 1). The model is trained on paired preferences where the chosen response should receive a higher score than the rejected response.
This configuration principle governs how the model is initialized and how the classification head relates to the pre-trained backbone. The padding token must be explicitly configured since classification models use it differently than generative models.
Usage
Use this configuration when initializing a new reward model for Bradley-Terry training. The resulting model is used as a reward scorer in the online iterative alignment loop (feedback labeling step).
Theoretical Basis
The reward model architecture maps a sequence to a scalar reward:
Where is a sequence classification model with num_labels=2 and the subscript 1 selects the positive-class logit. The model processes the concatenated prompt-response sequence through the transformer backbone and pools the final hidden state through the classification head.
The binary classification setup (rather than a single regression output) provides better training dynamics and allows the model to express uncertainty through the relative magnitude of the two logits.