Principle:Allenai Open instruct Score Head Initialization

Knowledge Sources	Learning to summarize from human feedback Training language models to follow instructions with human feedback
Domains	Reinforcement Learning from Human Feedback, Reward Modeling, Weight Initialization
Last Updated	2026-02-07 00:00 GMT

Overview

Score head initialization is the practice of initializing the reward model's linear projection layer (score head) with weights drawn from a normal distribution with a carefully chosen small standard deviation, ensuring that initial reward predictions are close to zero and do not introduce large, destabilizing gradients during early training.

Description

When a pre-trained language model is adapted into a reward model, a new linear layer (the "score head") is appended to project the transformer's hidden states to a single scalar reward value. The weights of this score head are not present in the pre-trained checkpoint and must be initialized from scratch.

Naive random initialization (e.g., using the default PyTorch initialization for nn.Linear, which uses Kaiming uniform) can produce initial reward predictions with large magnitudes. This is problematic because:

Large initial rewards create large gradients: The Bradley-Terry loss $- \log σ (r_{w} - r_{l})$ has gradients proportional to the sigmoid of the reward difference. If initial rewards are large and varied, the early gradient updates can be destabilizing.
Reward magnitude affects downstream RL training: If the reward model develops a habit of producing large-magnitude rewards during training, this can destabilize the subsequent RL optimization phase (e.g., PPO or GRPO).
Symmetry breaking should be gentle: The score head only needs small initial asymmetries to begin differentiating between chosen and rejected completions; large initial values are unnecessary and counterproductive.

The solution, as described in Stiennon et al. (2020), is to initialize the score head weights from a normal distribution with a standard deviation that is inversely proportional to the square root of the input dimension:

$σ = \frac{1}{\sqrt{d + 1}}$

where $d$ is the hidden dimension of the transformer model. This ensures that the initial output variance is approximately $O (1)$ regardless of the hidden dimension, following the principle that each weight contributes proportionally less as the fan-in increases.

Usage

Use this initialization strategy whenever:

Creating a new reward model from a pre-trained language model backbone.
Adding any new linear projection head on top of a transformer whose outputs should start near zero.
You need to ensure that the initial model outputs have controlled variance to prevent training instabilities.

Theoretical Basis

Consider the score head as a linear projection:

$r = W_{s} \cdot h + b_{s}$

where $h \in ℝ^{d}$ is the hidden state and $W_{s} \in ℝ^{1 \times d}$ are the weights. If we assume the hidden state components $h_{i}$ are roughly zero-mean with some variance $σ_{h}^{2}$ , then the variance of the output is:

$Var (r) = d \cdot σ_{w}^{2} \cdot σ_{h}^{2}$

By setting $σ_{w} = \frac{1}{\sqrt{d + 1}}$ , we get:

$Var (r) \approx \frac{d}{d + 1} \cdot σ_{h}^{2} \approx σ_{h}^{2}$

This means the initial reward predictions will have approximately the same variance as a single component of the hidden state, which is a small and well-controlled value. The $+ 1$ in the denominator is a minor correction that accounts for the bias term and ensures numerical stability when $d$ is small.

In Open Instruct, the specific initialization is:

$W_{s} \sim 𝒩 (0, \frac{1}{\sqrt{d + 1}})$

This follows p. 11 of Stiennon et al. (2020), "Learning to summarize from human feedback."

Comparison with Default Initialization

Method	Standard Deviation	Initial Output Scale (d=4096)
Kaiming Uniform (PyTorch default)	$\frac{1}{\sqrt{d}} \approx 0.0156$	Moderate, but input-scale dependent
Score Head Init (Open Instruct)	$\frac{1}{\sqrt{d + 1}} \approx 0.0156$	Controlled, near-zero rewards
Xavier Normal	$\sqrt{\frac{2}{d_{in} + d_{out}}}$	Balanced for deep networks
Large Random Init	$1.0$	Very large, unstable early training

For typical transformer hidden dimensions (2048-8192), the Open Instruct approach and Kaiming produce similar numerical values, but the intentional choice and explicit use of normal_ initialization (rather than relying on default behavior) makes the design decision clear and reproducible.

Related Pages

Implemented By

Implementation:Allenai_Open_instruct_Layer_Init

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment