Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF Sequence Regression Model Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Model_Loading, Reward_Modeling
Last Updated 2026-02-07 00:00 GMT

Overview

A pattern for constructing transformer models with a scalar value head for reward scoring or value function estimation in RLHF.

Description

Sequence Regression Model Loading dynamically constructs a model class by attaching a linear value head to a pretrained transformer base model. The value head projects the last hidden state to a scalar reward or value. Two variants exist: RewardModel (for reward model training, extracts reward at EOS token position) and CriticModel (for PPO value function, returns per-token values).

The factory function get_llm_for_sequence_regression handles model class creation, LoRA injection, quantization, ZeRO-3 compatibility, and optional value head initialization with proper random scaling.

Usage

Use this principle when loading a reward model for preference-based training, a critic model for PPO value estimation, or a reward model for batch inference scoring. Not used for policy (generative) models.

Theoretical Basis

The reward model maps a sequence to a scalar value: r(x)=WvhEOS(x) where hEOS is the hidden state at the end-of-sequence token and Wv1×d is the value head weight.

For the critic model in PPO, per-token values are computed: V(xt)=Wvht(x)

Value head initialization uses Wv𝒩(0,1d+1) for stable training start.

Related Pages

Implemented By


Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment