Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:ContextualAI HALOs Reward Model Training

From Leeroopedia


Knowledge Sources
Domains LLMs, Reward_Modeling, Alignment, LLM_Ops
Last Updated 2026-02-08 03:00 GMT

Overview

End-to-end process for training a Bradley-Terry reward model on pairwise preference data, producing a model that can score (prompt, response) pairs for use in alignment pipelines.

Description

This workflow trains a reward model using the Bradley-Terry framework, where the model learns to assign higher scores to preferred responses over rejected ones given pairwise comparisons. The resulting reward model can be used as a labeler in the Online Iterative Alignment workflow or for scoring model outputs during evaluation. The reward model is built on top of a language model backbone using AutoModelForBradleyTerry, which adds a classification head that outputs reward scores.

Goals:

  • Train a model that reliably distinguishes preferred from rejected responses
  • Produce a reward model checkpoint usable by the labeling script train.label
  • Enable automated feedback generation for online alignment loops

Scope:

  • From pairwise preference data to a saved reward model checkpoint
  • Uses the same training infrastructure (Hydra, Accelerate, FSDP) as alignment training

Strategy:

  • Uses BradleyTerryTrainer with PairedPreferenceDataLoader
  • The model learns a scalar reward function from pairwise comparisons
  • No reference model is needed (use_reference_model=false in the trainer)

Usage

Execute this workflow when you need a reward model to score LLM outputs for use in online alignment or to create labeled datasets from unlabeled samples. The reward model is trained on pairwise preference data where each example has a prompt with a chosen and a rejected response. Common input datasets include UltraFeedback and SHP.

Execution Steps

Step 1: Prepare_Preference_Data

Gather or select a pairwise preference dataset where each example contains a prompt, a chosen response, and a rejected response. HALOs supports built-in pairwise datasets (UltraFeedback, SHP, HH) or custom JSON files following the pairwise feedback schema.

Key considerations:

  • Data must be in pairwise format (prompt, chosen output, rejected output)
  • The PairedPreferenceDataLoader handles tokenization and batching
  • Dataset quality directly impacts reward model reliability
  • Multiple datasets can be combined by passing a list of names

Step 2: Configure_Training

Set up the Hydra configuration for Bradley-Terry training. The loss config specifies BradleyTerryTrainer and PairedPreferenceDataLoader. Choose the base model architecture (e.g., Llama, Mistral) and set hyperparameters.

What happens:

  • Select loss=bradley-terry in the launch command
  • Choose the model config (model=llama, model=mistral, etc.)
  • The AutoModelForBradleyTerry wrapper adds a two-class classification head to the base language model
  • No reference model is loaded since Bradley-Terry training does not use one

Step 3: Train_Reward_Model

Launch the training job using Accelerate with FSDP. The BradleyTerryTrainer computes a cross-entropy loss over the pairwise preferences, training the model to assign higher logits to chosen responses. The training loop uses the standard infrastructure shared with all alignment trainers.

What happens:

  • The base language model is loaded and wrapped with AutoModelForBradleyTerry
  • PairedPreferenceDataLoader tokenizes chosen and rejected responses
  • The trainer computes reward scores for both responses and optimizes the Bradley-Terry loss
  • Evaluation periodically checks reward accuracy on a held-out set
  • Training metrics are logged to Weights and Biases

Step 4: Save_And_Validate

After training completes, the reward model checkpoint is saved. The model can be validated by checking its accuracy on test data or by using it to label a set of sample outputs and inspecting the score distribution.

Key considerations:

  • The checkpoint is saved to cache_dir/exp_name/FINAL
  • The saved model path is used as the --reward_model_path argument in train.label
  • Reward accuracy on the test set indicates model quality
  • The model can also be used with external reward model libraries like ArmoRM

Execution Diagram

GitHub URL

Workflow Repository