Principle:Allenai Open instruct Reward Model Initialization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Natural Language Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Reward model initialization is the process of converting a pre-trained or supervised fine-tuned language model into a scalar-valued reward function by replacing the language modeling head with a single-output score head, enabling the model to predict human preference scores for generated text.
Description
In Reinforcement Learning from Human Feedback (RLHF), a reward model is needed to provide a scalar signal indicating how well a given text completion aligns with human preferences. Rather than training a reward model from scratch, the standard approach is to initialize the reward model from a pre-trained language model (typically one that has undergone supervised fine-tuning on instruction-following data). This leverages the rich language understanding already encoded in the pre-trained weights.
The key architectural modification is replacing the causal language modeling head (which predicts token probabilities over the vocabulary) with a sequence classification head that outputs a single scalar value. This is accomplished by using a sequence classification architecture with num_labels=1, which adds a linear projection layer on top of the transformer's last hidden state. The projection maps from the model's hidden dimension to a single real number, which serves as the reward score.
This initialization strategy offers several advantages:
- Transfer of language understanding: The transformer backbone retains its ability to understand syntax, semantics, and context from pre-training, so the reward model only needs to learn the preference mapping rather than language understanding from scratch.
- Efficient training: Starting from a strong language model checkpoint means the reward model converges faster and requires significantly less preference data compared to training from random initialization.
- Alignment with the policy model: When the reward model is initialized from the same checkpoint as the policy being optimized, the reward model's internal representations are well-matched to the policy's output distribution, leading to more stable RLHF training.
An important additional step during initialization is disabling dropout throughout the model. As described in Stiennon et al. (2020), removing dropout improves the stability and consistency of reward predictions during both training and inference. Since the reward model is evaluated deterministically (no sampling), stochastic dropout is unnecessary and can introduce unwanted noise into reward estimates.
Usage
Use reward model initialization when:
- Building a reward model for RLHF or preference-based training pipelines.
- You have access to a pre-trained language model (base or SFT checkpoint) and want to repurpose it as a preference scorer.
- You need a model that outputs scalar reward values for ranking or comparing text completions.
- You want to leverage existing language understanding rather than training a reward function from scratch.
Theoretical Basis
The reward model is defined as a parameterized function that maps a prompt-completion pair to a scalar reward. The architecture is:
where:
- is the last hidden state of the transformer at the final non-padding token position,
- is the score head weight matrix,
- is the score head bias (when present),
- is the hidden dimension of the transformer model.
The transformer backbone parameters are initialized from the pre-trained checkpoint:
Only the score head is randomly initialized (typically with a controlled small standard deviation; see Principle:Allenai_Open_instruct_Score_Head_Initialization).
The choice of num_labels=1 in the sequence classification setup means the model outputs a single unbounded scalar rather than class logits. This unbounded scalar is essential because the Bradley-Terry preference model operates on reward differences, and constraining the output range could limit the model's capacity to distinguish between responses of varying quality.
Pseudocode
1. Load pre-trained transformer model with a sequence classification head (num_labels=1)
2. Optionally resize token embeddings to match the tokenizer vocabulary
3. Disable all dropout layers in the model (set p=0)
4. Initialize the score head weights with a small standard deviation
5. Enable gradient checkpointing if memory-constrained
6. The model is now ready for reward model training on preference data