Implementation:ContextualAI HALOs BradleyTerryTrainer Train
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, NLP, Reinforcement_Learning |
| Last Updated | 2026-02-08 03:00 GMT |
Overview
Concrete tool for training a Bradley-Terry reward model on paired preferences provided by the BradleyTerryTrainer class.
Description
BradleyTerryTrainer extends PairedPreferenceTrainer to train a reward model using binary cross-entropy on score differences. Key differences from alignment trainers:
- Uses
AutoModelForBradleyTerryas the policy model class (sequence classifier, not causal LM) - Does not use a reference model (
use_reference_model = False) - The
forward()method returns logits (not log probabilities), split into chosen and rejected - The
loss()method computesBCE(chosen_score - rejected_score, 1)where scores are the positive-class logits - Reports reward accuracy as the primary evaluation metric
Usage
Invoke via accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin].
Code Reference
Source Location
- Repository: ContextualAI/HALOs
- File: train/trainers.py
- Lines: L1541-1631
Signature
class BradleyTerryTrainer(PairedPreferenceTrainer):
policy_hf_model_class = AutoModelForBradleyTerry
use_reference_model = False
def forward(
self,
model: AutoModelForBradleyTerry,
batch: Dict[str, Union[List, torch.LongTensor]]
) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
"""Get logits for chosen and rejected examples.
Returns:
chosen_logits: (microbatch_size, 2)
rejected_logits: (microbatch_size, 2)
"""
def loss(
self,
batch: Dict,
policy_chosen_logits: torch.FloatTensor,
policy_rejected_logits: torch.FloatTensor,
*args
) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
"""Bradley-Terry loss: BCE(chosen_score - rejected_score, 1).
Scores are logits[:, 1] (positive class logit).
Returns:
losses, chosen_scores, rejected_scores
"""
def get_batch_metrics(
self,
batch: Dict[str, Union[List, torch.LongTensor]],
mode: str = 'train'
) -> Tuple[torch.Tensor, Dict]:
"""Compute loss and metrics including reward accuracy."""
Import
from train.trainers import BradleyTerryTrainer
# Or invoke via CLI:
# accelerate launch launch.py loss=bradley-terry model=llama datasets=[ultrabin]
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | DictConfig | Yes | Hydra config with loss=bradley-terry |
| model | AutoModelForBradleyTerry | Yes | Sequence classifier with binary head |
| train_dataset | PairedPreferenceDataLoader | Yes | Iterator producing chosen/rejected pairs |
| eval_dataset | PairedPreferenceDataLoader | No | Evaluation data for reward accuracy |
Outputs
| Name | Type | Description |
|---|---|---|
| Trained reward model | Directory | Saved to {cache_dir}/{exp_name}/FINAL/ |
| Reward accuracy | float | Fraction of eval pairs where chosen_score > rejected_score |
| Training metrics | Dict | Loss, chosen/rejected scores, margins, accuracy per step |
Usage Examples
Train Bradley-Terry Reward Model
accelerate launch \
--config_file accelerate_config/fsdp_4gpu.yaml \
launch.py \
loss=bradley-terry \
model=llama \
datasets=[ultrabin] \
exp_name=llama3-8B-bt \
++model.name_or_path=meta-llama/Meta-Llama-3-8B \
++cache_dir=/models
Use Trained Reward Model for Labeling
# After training, use the reward model to label sampled completions
accelerate launch -m train.label \
--reward_model_path /models/llama3-8B-bt/FINAL \
--feedback_type pairwise \
samples.json feedback.json
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment