Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Score Head Initialization

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning from Human Feedback, Reward Modeling, Weight Initialization
Last Updated 2026-02-07 00:00 GMT

Overview

Score head initialization is the practice of initializing the reward model's linear projection layer (score head) with weights drawn from a normal distribution with a carefully chosen small standard deviation, ensuring that initial reward predictions are close to zero and do not introduce large, destabilizing gradients during early training.

Description

When a pre-trained language model is adapted into a reward model, a new linear layer (the "score head") is appended to project the transformer's hidden states to a single scalar reward value. The weights of this score head are not present in the pre-trained checkpoint and must be initialized from scratch.

Naive random initialization (e.g., using the default PyTorch initialization for nn.Linear, which uses Kaiming uniform) can produce initial reward predictions with large magnitudes. This is problematic because:

  • Large initial rewards create large gradients: The Bradley-Terry loss logσ(rwrl) has gradients proportional to the sigmoid of the reward difference. If initial rewards are large and varied, the early gradient updates can be destabilizing.
  • Reward magnitude affects downstream RL training: If the reward model develops a habit of producing large-magnitude rewards during training, this can destabilize the subsequent RL optimization phase (e.g., PPO or GRPO).
  • Symmetry breaking should be gentle: The score head only needs small initial asymmetries to begin differentiating between chosen and rejected completions; large initial values are unnecessary and counterproductive.

The solution, as described in Stiennon et al. (2020), is to initialize the score head weights from a normal distribution with a standard deviation that is inversely proportional to the square root of the input dimension:

σ=1d+1

where d is the hidden dimension of the transformer model. This ensures that the initial output variance is approximately O(1) regardless of the hidden dimension, following the principle that each weight contributes proportionally less as the fan-in increases.

Usage

Use this initialization strategy whenever:

  • Creating a new reward model from a pre-trained language model backbone.
  • Adding any new linear projection head on top of a transformer whose outputs should start near zero.
  • You need to ensure that the initial model outputs have controlled variance to prevent training instabilities.

Theoretical Basis

Consider the score head as a linear projection:

r=Wsh+bs

where hd is the hidden state and Ws1×d are the weights. If we assume the hidden state components hi are roughly zero-mean with some variance σh2, then the variance of the output is:

Var(r)=dσw2σh2

By setting σw=1d+1, we get:

Var(r)dd+1σh2σh2

This means the initial reward predictions will have approximately the same variance as a single component of the hidden state, which is a small and well-controlled value. The +1 in the denominator is a minor correction that accounts for the bias term and ensures numerical stability when d is small.

In Open Instruct, the specific initialization is:

Ws𝒩(0,1d+1)

This follows p. 11 of Stiennon et al. (2020), "Learning to summarize from human feedback."

Comparison with Default Initialization

Method Standard Deviation Initial Output Scale (d=4096)
Kaiming Uniform (PyTorch default) 1d0.0156 Moderate, but input-scale dependent
Score Head Init (Open Instruct) 1d+10.0156 Controlled, near-zero rewards
Xavier Normal 2din+dout Balanced for deep networks
Large Random Init 1.0 Very large, unstable early training

For typical transformer hidden dimensions (2048-8192), the Open Instruct approach and Kaiming produce similar numerical values, but the intentional choice and explicit use of normal_ initialization (rather than relying on default behavior) makes the design decision clear and reproducible.

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment