Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:ContextualAI HALOs BradleyTerryTrainer Train

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, NLP, Reinforcement_Learning
Last Updated 2026-02-08 03:00 GMT

Overview

Concrete tool for training a Bradley-Terry reward model on paired preferences provided by the BradleyTerryTrainer class.

Description

BradleyTerryTrainer extends PairedPreferenceTrainer to train a reward model using binary cross-entropy on score differences. Key differences from alignment trainers:

  • Uses AutoModelForBradleyTerry as the policy model class (sequence classifier, not causal LM)
  • Does not use a reference model (use_reference_model = False)
  • The forward() method returns logits (not log probabilities), split into chosen and rejected
  • The loss() method computes BCE(chosen_score - rejected_score, 1) where scores are the positive-class logits
  • Reports reward accuracy as the primary evaluation metric

Usage

Invoke via accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin].

Code Reference

Source Location

Signature

class BradleyTerryTrainer(PairedPreferenceTrainer):
    policy_hf_model_class = AutoModelForBradleyTerry
    use_reference_model = False

    def forward(
        self,
        model: AutoModelForBradleyTerry,
        batch: Dict[str, Union[List, torch.LongTensor]]
    ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
        """Get logits for chosen and rejected examples.

        Returns:
            chosen_logits: (microbatch_size, 2)
            rejected_logits: (microbatch_size, 2)
        """

    def loss(
        self,
        batch: Dict,
        policy_chosen_logits: torch.FloatTensor,
        policy_rejected_logits: torch.FloatTensor,
        *args
    ) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
        """Bradley-Terry loss: BCE(chosen_score - rejected_score, 1).

        Scores are logits[:, 1] (positive class logit).

        Returns:
            losses, chosen_scores, rejected_scores
        """

    def get_batch_metrics(
        self,
        batch: Dict[str, Union[List, torch.LongTensor]],
        mode: str = 'train'
    ) -> Tuple[torch.Tensor, Dict]:
        """Compute loss and metrics including reward accuracy."""

Import

from train.trainers import BradleyTerryTrainer
# Or invoke via CLI:
# accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin]

I/O Contract

Inputs

Name Type Required Description
config DictConfig Yes Hydra config with loss=bradley-terry
model AutoModelForBradleyTerry Yes Sequence classifier with binary head
train_dataset PairedPreferenceDataLoader Yes Iterator producing chosen/rejected pairs
eval_dataset PairedPreferenceDataLoader No Evaluation data for reward accuracy

Outputs

Name Type Description
Trained reward model Directory Saved to {cache_dir}/{exp_name}/FINAL/
Reward accuracy float Fraction of eval pairs where chosen_score > rejected_score
Training metrics Dict Loss, chosen/rejected scores, margins, accuracy per step

Usage Examples

Train Bradley-Terry Reward Model

accelerate launch \
    --config_file accelerate_config/fsdp_4gpu.yaml \
    launch.py \
    loss=bradley-terry \
    model=llama \
    datasets=[ultrabin] \
    exp_name=llama3-8B-bt \
    ++model.name_or_path=meta-llama/Meta-Llama-3-8B \
    ++cache_dir=/models

Use Trained Reward Model for Labeling

# After training, use the reward model to label sampled completions
accelerate launch -m train.label \
    --reward_model_path /models/llama3-8B-bt/FINAL \
    --feedback_type pairwise \
    samples.json feedback.json

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment