Principle:Allenai Open instruct Score Head Initialization
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement Learning from Human Feedback, Reward Modeling, Weight Initialization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Score head initialization is the practice of initializing the reward model's linear projection layer (score head) with weights drawn from a normal distribution with a carefully chosen small standard deviation, ensuring that initial reward predictions are close to zero and do not introduce large, destabilizing gradients during early training.
Description
When a pre-trained language model is adapted into a reward model, a new linear layer (the "score head") is appended to project the transformer's hidden states to a single scalar reward value. The weights of this score head are not present in the pre-trained checkpoint and must be initialized from scratch.
Naive random initialization (e.g., using the default PyTorch initialization for nn.Linear, which uses Kaiming uniform) can produce initial reward predictions with large magnitudes. This is problematic because:
- Large initial rewards create large gradients: The Bradley-Terry loss has gradients proportional to the sigmoid of the reward difference. If initial rewards are large and varied, the early gradient updates can be destabilizing.
- Reward magnitude affects downstream RL training: If the reward model develops a habit of producing large-magnitude rewards during training, this can destabilize the subsequent RL optimization phase (e.g., PPO or GRPO).
- Symmetry breaking should be gentle: The score head only needs small initial asymmetries to begin differentiating between chosen and rejected completions; large initial values are unnecessary and counterproductive.
The solution, as described in Stiennon et al. (2020), is to initialize the score head weights from a normal distribution with a standard deviation that is inversely proportional to the square root of the input dimension:
where is the hidden dimension of the transformer model. This ensures that the initial output variance is approximately regardless of the hidden dimension, following the principle that each weight contributes proportionally less as the fan-in increases.
Usage
Use this initialization strategy whenever:
- Creating a new reward model from a pre-trained language model backbone.
- Adding any new linear projection head on top of a transformer whose outputs should start near zero.
- You need to ensure that the initial model outputs have controlled variance to prevent training instabilities.
Theoretical Basis
Consider the score head as a linear projection:
where is the hidden state and are the weights. If we assume the hidden state components are roughly zero-mean with some variance , then the variance of the output is:
By setting , we get:
This means the initial reward predictions will have approximately the same variance as a single component of the hidden state, which is a small and well-controlled value. The in the denominator is a minor correction that accounts for the bias term and ensures numerical stability when is small.
In Open Instruct, the specific initialization is:
This follows p. 11 of Stiennon et al. (2020), "Learning to summarize from human feedback."
Comparison with Default Initialization
| Method | Standard Deviation | Initial Output Scale (d=4096) |
|---|---|---|
| Kaiming Uniform (PyTorch default) | Moderate, but input-scale dependent | |
| Score Head Init (Open Instruct) | Controlled, near-zero rewards | |
| Xavier Normal | Balanced for deep networks | |
| Large Random Init | Very large, unstable early training |
For typical transformer hidden dimensions (2048-8192), the Open Instruct approach and Kaiming produce similar numerical values, but the intentional choice and explicit use of normal_ initialization (rather than relying on default behavior) makes the design decision clear and reproducible.