Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:OpenRLHF OpenRLHF Value Head ZeRO3 Init Tip

From Leeroopedia




Knowledge Sources
Domains Debugging, Distributed_Training, LLMs
Last Updated 2026-02-07 10:00 GMT

Overview

Manually initialize reward model value heads with `std = 1/(hidden_size+1)` because DeepSpeed ZeRO-3 skips custom layer initialization.

Description

When training reward or critic models with DeepSpeed ZeRO Stage 3, the `deepspeed.zero.Init()` context manager does not automatically initialize custom layers added on top of the base transformer (such as the linear value head used for reward/critic scoring). If left uninitialized, these layers contain garbage values, leading to training instability or NaN losses. OpenRLHF addresses this by explicitly initializing the value head using a normal distribution with `mean=0.0` and `std=1/(hidden_size+1)`, wrapped in a `GatheredParameters` context on rank 0.

Usage

Apply this heuristic when training reward models or critic models with ZeRO Stage 3. The `init_value_head=True` flag must be set during model construction. This is handled automatically by the `get_llm_for_sequence_regression` factory function.

The Insight (Rule of Thumb)

  • Action: Set `init_value_head=True` when constructing reward/critic models with ZeRO-3.
  • Value: Initialize with `weight.data.normal_(mean=0.0, std=1/(config.hidden_size+1))`.
  • Trade-off: None. This is a correctness fix, not an optimization trade-off.
  • Note: The value head prefix (default "score") must be saved to the model config for proper loading during inference.

Reasoning

DeepSpeed ZeRO-3 partitions parameters across GPUs during initialization. Custom layers (like `nn.Linear(hidden_size, 1, bias=False)`) added after model construction via `setattr` are not visible to ZeRO's automatic initialization. Without explicit initialization, the value head contains random memory, causing reward model training to produce incorrect gradients. The `GatheredParameters` context ensures all shards of the parameter are available on rank 0 for initialization, then re-partitions them.

Code evidence from `openrlhf/models/model.py:154-165`:

# NOTE: For reward model training only, intialize value_head manually
# because deepspeed.zero.Init() will not intialize them.
# TODO: Find a better way to clarify reward model training.
if init_value_head:
    value_head = getattr(model, value_head_prefix)
    if dschf is not None:
        logger.info("initialize value_head for ZeRO-3 reward model training.")
        with deepspeed.zero.GatheredParameters([value_head.weight], modifier_rank=0):
            if torch.distributed.get_rank() == 0:
                value_head.weight.data.normal_(mean=0.0, std=1 / (config.hidden_size + 1))
    else:
        value_head.weight.data.normal_(mean=0.0, std=1 / (config.hidden_size + 1))

Value head creation from `openrlhf/models/model.py:178-179`:

self.value_head_prefix = value_head_prefix
setattr(self, value_head_prefix, nn.Linear(config.hidden_size, 1, bias=False))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment