Heuristic:OpenRLHF OpenRLHF Value Head ZeRO3 Init Tip
| Knowledge Sources | |
|---|---|
| Domains | Debugging, Distributed_Training, LLMs |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
Manually initialize reward model value heads with `std = 1/(hidden_size+1)` because DeepSpeed ZeRO-3 skips custom layer initialization.
Description
When training reward or critic models with DeepSpeed ZeRO Stage 3, the `deepspeed.zero.Init()` context manager does not automatically initialize custom layers added on top of the base transformer (such as the linear value head used for reward/critic scoring). If left uninitialized, these layers contain garbage values, leading to training instability or NaN losses. OpenRLHF addresses this by explicitly initializing the value head using a normal distribution with `mean=0.0` and `std=1/(hidden_size+1)`, wrapped in a `GatheredParameters` context on rank 0.
Usage
Apply this heuristic when training reward models or critic models with ZeRO Stage 3. The `init_value_head=True` flag must be set during model construction. This is handled automatically by the `get_llm_for_sequence_regression` factory function.
The Insight (Rule of Thumb)
- Action: Set `init_value_head=True` when constructing reward/critic models with ZeRO-3.
- Value: Initialize with `weight.data.normal_(mean=0.0, std=1/(config.hidden_size+1))`.
- Trade-off: None. This is a correctness fix, not an optimization trade-off.
- Note: The value head prefix (default "score") must be saved to the model config for proper loading during inference.
Reasoning
DeepSpeed ZeRO-3 partitions parameters across GPUs during initialization. Custom layers (like `nn.Linear(hidden_size, 1, bias=False)`) added after model construction via `setattr` are not visible to ZeRO's automatic initialization. Without explicit initialization, the value head contains random memory, causing reward model training to produce incorrect gradients. The `GatheredParameters` context ensures all shards of the parameter are available on rank 0 for initialization, then re-partitions them.
Code evidence from `openrlhf/models/model.py:154-165`:
# NOTE: For reward model training only, intialize value_head manually
# because deepspeed.zero.Init() will not intialize them.
# TODO: Find a better way to clarify reward model training.
if init_value_head:
value_head = getattr(model, value_head_prefix)
if dschf is not None:
logger.info("initialize value_head for ZeRO-3 reward model training.")
with deepspeed.zero.GatheredParameters([value_head.weight], modifier_rank=0):
if torch.distributed.get_rank() == 0:
value_head.weight.data.normal_(mean=0.0, std=1 / (config.hidden_size + 1))
else:
value_head.weight.data.normal_(mean=0.0, std=1 / (config.hidden_size + 1))
Value head creation from `openrlhf/models/model.py:178-179`:
self.value_head_prefix = value_head_prefix
setattr(self, value_head_prefix, nn.Linear(config.hidden_size, 1, bias=False))