Workflow:OpenRLHF OpenRLHF Reward Model Training
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, Reward_Modeling |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
End-to-end process for training a reward model from human preference data to score language model responses by quality.
Description
This workflow trains a discriminative reward model that assigns scalar quality scores to language model responses. Starting from a pretrained or SFT-tuned model, it adds a value head (linear projection to a single scalar) and trains on preference pairs (chosen vs. rejected responses). The model learns to assign higher scores to preferred responses using a pairwise ranking loss. The trained reward model is used downstream in PPO training, rejection sampling, or conditional SFT to guide policy optimization.
Usage
Execute this workflow when you have a dataset of human preference pairs (chosen/rejected response pairs for the same prompt) and need a reward signal for RL-based alignment. This is the second stage in the canonical RLHF pipeline (after SFT, before PPO). The resulting reward model can also be used for batch scoring, rejection sampling, or iterative DPO data creation.
Execution Steps
Step 1: Configure distributed strategy
Initialize the DeepSpeed training strategy with ZeRO-3 parallelism (recommended for reward models since they are typically loaded at full precision for accurate scoring). Configure precision and gradient accumulation settings.
Key considerations:
- ZeRO-3 is standard for reward model training to shard model states across GPUs
- bf16 mixed precision balances memory and accuracy
Step 2: Load model with value head
Load a pretrained or SFT model and add a value head for scalar reward prediction. The value head is a linear layer that projects the final hidden state to a single scalar output. Initialize the value head weights.
Key considerations:
- Typically starts from the SFT model checkpoint
- The value head prefix is configurable (default: "value_head")
- LoRA can be applied for parameter-efficient reward model training
Step 3: Prepare preference dataset
Load the preference dataset containing chosen and rejected response pairs. Each example contains a prompt with two responses: one preferred (chosen) and one dispreferred (rejected). Tokenize both responses and create paired batches.
Key considerations:
- Dataset format requires chosen_key and rejected_key fields
- Chat templates are applied to both chosen and rejected responses
- Sample packing can improve throughput for shorter sequences
- Maximum sequence length should accommodate the longest responses
Step 4: Setup optimizer and scheduler
Configure the optimizer with a learning rate appropriate for reward model training (typically higher than SFT, e.g., 9e-6). Set up cosine learning rate scheduling with warmup.
Key considerations:
- Reward model training typically uses higher learning rates than SFT
- Single epoch training is common to avoid overfitting on preference data
Step 5: Train the reward model
Execute the reward model training loop. For each batch, compute the pairwise ranking loss (Bradley-Terry model) that maximizes the margin between chosen and rejected scores. The sigmoid loss is most common, pushing the reward for chosen responses above rejected ones.
Key considerations:
- Monitor the accuracy of chosen vs. rejected ranking on evaluation data
- Watch for reward model overfitting (accuracy plateaus then degrades)
- The loss function can be sigmoid (default), ordinal, or other variants
Step 6: Save reward model
Save the complete reward model including the value head weights. Store the value head prefix in the model config for downstream loading.
Key considerations:
- The saved model includes both the base transformer and the value head
- Downstream consumers (PPO, batch inference) need the value_head_prefix to load correctly