Workflow:OpenRLHF OpenRLHF Reward Model Training

Knowledge Sources	OpenRLHF Hugging Face Transformers DeepSpeed
Domains	LLMs, RLHF, Reward_Modeling
Last Updated	2026-02-07 10:00 GMT

Overview

End-to-end process for training a reward model from human preference data to score language model responses by quality.

Description

This workflow trains a discriminative reward model that assigns scalar quality scores to language model responses. Starting from a pretrained or SFT-tuned model, it adds a value head (linear projection to a single scalar) and trains on preference pairs (chosen vs. rejected responses). The model learns to assign higher scores to preferred responses using a pairwise ranking loss. The trained reward model is used downstream in PPO training, rejection sampling, or conditional SFT to guide policy optimization.

Usage

Execute this workflow when you have a dataset of human preference pairs (chosen/rejected response pairs for the same prompt) and need a reward signal for RL-based alignment. This is the second stage in the canonical RLHF pipeline (after SFT, before PPO). The resulting reward model can also be used for batch scoring, rejection sampling, or iterative DPO data creation.

Execution Steps

Step 1: Configure distributed strategy

Initialize the DeepSpeed training strategy with ZeRO-3 parallelism (recommended for reward models since they are typically loaded at full precision for accurate scoring). Configure precision and gradient accumulation settings.

Key considerations:

ZeRO-3 is standard for reward model training to shard model states across GPUs
bf16 mixed precision balances memory and accuracy

Step 2: Load model with value head

Load a pretrained or SFT model and add a value head for scalar reward prediction. The value head is a linear layer that projects the final hidden state to a single scalar output. Initialize the value head weights.

Key considerations:

Typically starts from the SFT model checkpoint
The value head prefix is configurable (default: "value_head")
LoRA can be applied for parameter-efficient reward model training

Step 3: Prepare preference dataset

Load the preference dataset containing chosen and rejected response pairs. Each example contains a prompt with two responses: one preferred (chosen) and one dispreferred (rejected). Tokenize both responses and create paired batches.

Key considerations:

Dataset format requires chosen_key and rejected_key fields
Chat templates are applied to both chosen and rejected responses
Sample packing can improve throughput for shorter sequences
Maximum sequence length should accommodate the longest responses

Step 4: Setup optimizer and scheduler

Configure the optimizer with a learning rate appropriate for reward model training (typically higher than SFT, e.g., 9e-6). Set up cosine learning rate scheduling with warmup.

Key considerations:

Reward model training typically uses higher learning rates than SFT
Single epoch training is common to avoid overfitting on preference data

Step 5: Train the reward model

Execute the reward model training loop. For each batch, compute the pairwise ranking loss (Bradley-Terry model) that maximizes the margin between chosen and rejected scores. The sigmoid loss is most common, pushing the reward for chosen responses above rejected ones.

Key considerations:

Monitor the accuracy of chosen vs. rejected ranking on evaluation data
Watch for reward model overfitting (accuracy plateaus then degrades)
The loss function can be sigmoid (default), ordinal, or other variants

Step 6: Save reward model

Save the complete reward model including the value head weights. Store the value head prefix in the model config for downstream loading.

Key considerations:

The saved model includes both the base transformer and the value head
Downstream consumers (PPO, batch inference) need the value_head_prefix to load correctly

Execution Diagram

GitHub URL

Workflow Repository