Implementation:ContextualAI HALOs BradleyTerryTrainer Train

Knowledge Sources	ContextualAI HALOs
Domains	Deep_Learning, NLP, Reinforcement_Learning
Last Updated	2026-02-08 03:00 GMT

Overview

Concrete tool for training a Bradley-Terry reward model on paired preferences provided by the BradleyTerryTrainer class.

Description

BradleyTerryTrainer extends PairedPreferenceTrainer to train a reward model using binary cross-entropy on score differences. Key differences from alignment trainers:

Uses AutoModelForBradleyTerry as the policy model class (sequence classifier, not causal LM)
Does not use a reference model (use_reference_model = False)
The forward() method returns logits (not log probabilities), split into chosen and rejected
The loss() method computes BCE(chosen_score - rejected_score, 1) where scores are the positive-class logits
Reports reward accuracy as the primary evaluation metric

Usage

Invoke via accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin].

Code Reference

Source Location

Repository: ContextualAI/HALOs
File: train/trainers.py
Lines: L1541-1631

Signature

class BradleyTerryTrainer(PairedPreferenceTrainer):
    policy_hf_model_class = AutoModelForBradleyTerry
    use_reference_model = False

    def forward(
        self,
        model: AutoModelForBradleyTerry,
        batch: Dict[str, Union[List, torch.LongTensor]]
    ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
        """Get logits for chosen and rejected examples.

        Returns:
            chosen_logits: (microbatch_size, 2)
            rejected_logits: (microbatch_size, 2)
        """

    def loss(
        self,
        batch: Dict,
        policy_chosen_logits: torch.FloatTensor,
        policy_rejected_logits: torch.FloatTensor,
        *args
    ) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
        """Bradley-Terry loss: BCE(chosen_score - rejected_score, 1).

        Scores are logits[:, 1] (positive class logit).

        Returns:
            losses, chosen_scores, rejected_scores
        """

    def get_batch_metrics(
        self,
        batch: Dict[str, Union[List, torch.LongTensor]],
        mode: str = 'train'
    ) -> Tuple[torch.Tensor, Dict]:
        """Compute loss and metrics including reward accuracy."""

Import

from train.trainers import BradleyTerryTrainer
# Or invoke via CLI:
# accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin]

I/O Contract

Inputs

Name	Type	Required	Description
config	DictConfig	Yes	Hydra config with loss=bradley-terry
model	AutoModelForBradleyTerry	Yes	Sequence classifier with binary head
train_dataset	PairedPreferenceDataLoader	Yes	Iterator producing chosen/rejected pairs
eval_dataset	PairedPreferenceDataLoader	No	Evaluation data for reward accuracy

Outputs

Name	Type	Description
Trained reward model	Directory	Saved to {cache_dir}/{exp_name}/FINAL/
Reward accuracy	float	Fraction of eval pairs where chosen_score > rejected_score
Training metrics	Dict	Loss, chosen/rejected scores, margins, accuracy per step

Usage Examples

Train Bradley-Terry Reward Model

accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=bradley-terry \
    model=llama \
    datasets=[ultrabin] \
    exp_name=llama3-8B-bt \
    ++model.name_or_path=meta-llama/Meta-Llama-3-8B \
    ++cache_dir=/models

Use Trained Reward Model for Labeling

# After training, use the reward model to label sampled completions
accelerate launch -m train.label \
    --reward_model_path /models/llama3-8B-bt/FINAL \
    --feedback_type pairwise \
    samples.json feedback.json

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment