Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Reward Model Initialization

From Leeroopedia


Knowledge Sources
Domains Reinforcement Learning from Human Feedback, Reward Modeling, Natural Language Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Reward model initialization is the process of converting a pre-trained or supervised fine-tuned language model into a scalar-valued reward function by replacing the language modeling head with a single-output score head, enabling the model to predict human preference scores for generated text.

Description

In Reinforcement Learning from Human Feedback (RLHF), a reward model is needed to provide a scalar signal indicating how well a given text completion aligns with human preferences. Rather than training a reward model from scratch, the standard approach is to initialize the reward model from a pre-trained language model (typically one that has undergone supervised fine-tuning on instruction-following data). This leverages the rich language understanding already encoded in the pre-trained weights.

The key architectural modification is replacing the causal language modeling head (which predicts token probabilities over the vocabulary) with a sequence classification head that outputs a single scalar value. This is accomplished by using a sequence classification architecture with num_labels=1, which adds a linear projection layer on top of the transformer's last hidden state. The projection maps from the model's hidden dimension to a single real number, which serves as the reward score.

This initialization strategy offers several advantages:

  • Transfer of language understanding: The transformer backbone retains its ability to understand syntax, semantics, and context from pre-training, so the reward model only needs to learn the preference mapping rather than language understanding from scratch.
  • Efficient training: Starting from a strong language model checkpoint means the reward model converges faster and requires significantly less preference data compared to training from random initialization.
  • Alignment with the policy model: When the reward model is initialized from the same checkpoint as the policy being optimized, the reward model's internal representations are well-matched to the policy's output distribution, leading to more stable RLHF training.

An important additional step during initialization is disabling dropout throughout the model. As described in Stiennon et al. (2020), removing dropout improves the stability and consistency of reward predictions during both training and inference. Since the reward model is evaluated deterministically (no sampling), stochastic dropout is unnecessary and can introduce unwanted noise into reward estimates.

Usage

Use reward model initialization when:

  • Building a reward model for RLHF or preference-based training pipelines.
  • You have access to a pre-trained language model (base or SFT checkpoint) and want to repurpose it as a preference scorer.
  • You need a model that outputs scalar reward values for ranking or comparing text completions.
  • You want to leverage existing language understanding rather than training a reward function from scratch.

Theoretical Basis

The reward model is defined as a parameterized function rθ:𝒳×𝒴 that maps a prompt-completion pair to a scalar reward. The architecture is:

rθ(x,y)=WshL(x,y)+bs

where:

  • hL(x,y)d is the last hidden state of the transformer at the final non-padding token position,
  • Ws1×d is the score head weight matrix,
  • bs is the score head bias (when present),
  • d is the hidden dimension of the transformer model.

The transformer backbone parameters are initialized from the pre-trained checkpoint:

θbackboneθpretrained

Only the score head (Ws,bs) is randomly initialized (typically with a controlled small standard deviation; see Principle:Allenai_Open_instruct_Score_Head_Initialization).

The choice of num_labels=1 in the sequence classification setup means the model outputs a single unbounded scalar rather than class logits. This unbounded scalar is essential because the Bradley-Terry preference model operates on reward differences, and constraining the output range could limit the model's capacity to distinguish between responses of varying quality.

Pseudocode

1. Load pre-trained transformer model with a sequence classification head (num_labels=1)
2. Optionally resize token embeddings to match the tokenizer vocabulary
3. Disable all dropout layers in the model (set p=0)
4. Initialize the score head weights with a small standard deviation
5. Enable gradient checkpointing if memory-constrained
6. The model is now ready for reward model training on preference data

Related Pages

Implemented By

Related Principles

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment