Workflow:ContextualAI HALOs Reward Model Training

Knowledge Sources	HALOs KTO: Model Alignment as Prospect Theoretic Optimization
Domains	LLMs, Reward_Modeling, Alignment, LLM_Ops
Last Updated	2026-02-08 03:00 GMT

Overview

End-to-end process for training a Bradley-Terry reward model on pairwise preference data, producing a model that can score (prompt, response) pairs for use in alignment pipelines.

Description

This workflow trains a reward model using the Bradley-Terry framework, where the model learns to assign higher scores to preferred responses over rejected ones given pairwise comparisons. The resulting reward model can be used as a labeler in the Online Iterative Alignment workflow or for scoring model outputs during evaluation. The reward model is built on top of a language model backbone using AutoModelForBradleyTerry, which adds a classification head that outputs reward scores.

Goals:

Train a model that reliably distinguishes preferred from rejected responses
Produce a reward model checkpoint usable by the labeling script train.label
Enable automated feedback generation for online alignment loops

Scope:

From pairwise preference data to a saved reward model checkpoint
Uses the same training infrastructure (Hydra, Accelerate, FSDP) as alignment training

Strategy:

Uses BradleyTerryTrainer with PairedPreferenceDataLoader
The model learns a scalar reward function from pairwise comparisons
No reference model is needed (use_reference_model=false in the trainer)

Usage

Execute this workflow when you need a reward model to score LLM outputs for use in online alignment or to create labeled datasets from unlabeled samples. The reward model is trained on pairwise preference data where each example has a prompt with a chosen and a rejected response. Common input datasets include UltraFeedback and SHP.

Execution Steps

Step 1: Prepare_Preference_Data

Gather or select a pairwise preference dataset where each example contains a prompt, a chosen response, and a rejected response. HALOs supports built-in pairwise datasets (UltraFeedback, SHP, HH) or custom JSON files following the pairwise feedback schema.

Key considerations:

Data must be in pairwise format (prompt, chosen output, rejected output)
The PairedPreferenceDataLoader handles tokenization and batching
Dataset quality directly impacts reward model reliability
Multiple datasets can be combined by passing a list of names

Step 2: Configure_Training

Set up the Hydra configuration for Bradley-Terry training. The loss config specifies BradleyTerryTrainer and PairedPreferenceDataLoader. Choose the base model architecture (e.g., Llama, Mistral) and set hyperparameters.

What happens:

Select loss=bradley-terry in the launch command
Choose the model config (model=llama, model=mistral, etc.)
The AutoModelForBradleyTerry wrapper adds a two-class classification head to the base language model
No reference model is loaded since Bradley-Terry training does not use one

Step 3: Train_Reward_Model

Launch the training job using Accelerate with FSDP. The BradleyTerryTrainer computes a cross-entropy loss over the pairwise preferences, training the model to assign higher logits to chosen responses. The training loop uses the standard infrastructure shared with all alignment trainers.

What happens:

The base language model is loaded and wrapped with AutoModelForBradleyTerry
PairedPreferenceDataLoader tokenizes chosen and rejected responses
The trainer computes reward scores for both responses and optimizes the Bradley-Terry loss
Evaluation periodically checks reward accuracy on a held-out set
Training metrics are logged to Weights and Biases

Step 4: Save_And_Validate

After training completes, the reward model checkpoint is saved. The model can be validated by checking its accuracy on test data or by using it to label a set of sample outputs and inspecting the score distribution.

Key considerations:

The checkpoint is saved to cache_dir/exp_name/FINAL
The saved model path is used as the --reward_model_path argument in train.label
Reward accuracy on the test set indicates model quality
The model can also be used with external reward model libraries like ArmoRM

Execution Diagram

GitHub URL

Workflow Repository