Principle:OpenRLHF OpenRLHF Sequence Regression Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Loading, Reward_Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A pattern for constructing transformer models with a scalar value head for reward scoring or value function estimation in RLHF.
Description
Sequence Regression Model Loading dynamically constructs a model class by attaching a linear value head to a pretrained transformer base model. The value head projects the last hidden state to a scalar reward or value. Two variants exist: RewardModel (for reward model training, extracts reward at EOS token position) and CriticModel (for PPO value function, returns per-token values).
The factory function get_llm_for_sequence_regression handles model class creation, LoRA injection, quantization, ZeRO-3 compatibility, and optional value head initialization with proper random scaling.
Usage
Use this principle when loading a reward model for preference-based training, a critic model for PPO value estimation, or a reward model for batch inference scoring. Not used for policy (generative) models.
Theoretical Basis
The reward model maps a sequence to a scalar value: where is the hidden state at the end-of-sequence token and is the value head weight.
For the critic model in PPO, per-token values are computed:
Value head initialization uses for stable training start.
Related Pages
Implemented By