Principle:OpenRLHF OpenRLHF Sequence Regression Model Loading

Knowledge Sources	Training language models to follow instructions with human feedback HuggingFace Transformers
Domains	NLP, Model_Loading, Reward_Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

A pattern for constructing transformer models with a scalar value head for reward scoring or value function estimation in RLHF.

Description

Sequence Regression Model Loading dynamically constructs a model class by attaching a linear value head to a pretrained transformer base model. The value head projects the last hidden state to a scalar reward or value. Two variants exist: RewardModel (for reward model training, extracts reward at EOS token position) and CriticModel (for PPO value function, returns per-token values).

The factory function get_llm_for_sequence_regression handles model class creation, LoRA injection, quantization, ZeRO-3 compatibility, and optional value head initialization with proper random scaling.

Usage

Use this principle when loading a reward model for preference-based training, a critic model for PPO value estimation, or a reward model for batch inference scoring. Not used for policy (generative) models.

Theoretical Basis

The reward model maps a sequence to a scalar value: $r (x) = W_{v} \cdot h_{E O S} (x)$ where $h_{E O S}$ is the hidden state at the end-of-sequence token and $W_{v} \in ℝ^{1 \times d}$ is the value head weight.

For the critic model in PPO, per-token values are computed: $V (x_{t}) = W_{v} \cdot h_{t} (x)$

Value head initialization uses $W_{v} \sim 𝒩 (0, \frac{1}{d + 1})$ for stable training start.

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_Get_llm_for_sequence_regression

Uses Heuristic

Heuristic:OpenRLHF_OpenRLHF_Value_Head_ZeRO3_Init_Tip

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment