Heuristic:Allenai Open instruct Polyak Reference Update
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Use Polyak averaging with alpha=0.6 to update the reference policy, blending 60% of the current policy with 40% of the old reference.
Description
Rather than keeping a fixed reference policy throughout GRPO training, Open Instruct supports periodic Polyak (exponential moving average) updates to the reference policy. The update formula is: `ref_param = alpha * param + (1 - alpha) * ref_param`. With alpha=0.6, the reference tracks the current policy more closely than typical EMA settings, allowing the KL penalty to remain informative even as the policy improves significantly.
Usage
Apply this heuristic when using reference policy updates in GRPO (when `ref_policy_update_freq` is set). Not applicable when using a fixed reference (the default). This technique is borrowed from TR-DPO and TD3's target network updates.
The Insight (Rule of Thumb)
- Action: Set `alpha = 0.6` for reference policy Polyak updates.
- Value: 0.6 (60% current policy, 40% old reference).
- Trade-off: Higher alpha = faster reference tracking = weaker KL constraint. Lower alpha = slower tracking = more conservative training.
Reasoning
A fixed reference policy becomes increasingly irrelevant as the policy improves, making the KL penalty either too strong (preventing useful learning) or meaningless (too far from the current policy to provide signal). Polyak averaging maintains a "moving reference" that stays relevant. The alpha=0.6 default is aggressive compared to TD3's typical 0.005, reflecting the different dynamics of language model RL where policies change more slowly per step.
Code Evidence
Configuration from `open_instruct/grpo_utils.py:88-92`:
alpha: float = 0.6
"""The alpha value for doing polyak updates (ref_param = alpha * param + (1 - alpha) * ref_param)
reference: [TR-DPO](https://huggingface.co/papers/2404.09656), but it's actually pretty commonly
used. E.g., [TD3](https://arxiv.org/abs/1802.09477)"""