Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Allenai Open instruct Polyak Reference Update

From Leeroopedia




Knowledge Sources
Domains Reinforcement_Learning, Optimization
Last Updated 2026-02-07 00:00 GMT

Overview

Use Polyak averaging with alpha=0.6 to update the reference policy, blending 60% of the current policy with 40% of the old reference.

Description

Rather than keeping a fixed reference policy throughout GRPO training, Open Instruct supports periodic Polyak (exponential moving average) updates to the reference policy. The update formula is: `ref_param = alpha * param + (1 - alpha) * ref_param`. With alpha=0.6, the reference tracks the current policy more closely than typical EMA settings, allowing the KL penalty to remain informative even as the policy improves significantly.

Usage

Apply this heuristic when using reference policy updates in GRPO (when `ref_policy_update_freq` is set). Not applicable when using a fixed reference (the default). This technique is borrowed from TR-DPO and TD3's target network updates.

The Insight (Rule of Thumb)

  • Action: Set `alpha = 0.6` for reference policy Polyak updates.
  • Value: 0.6 (60% current policy, 40% old reference).
  • Trade-off: Higher alpha = faster reference tracking = weaker KL constraint. Lower alpha = slower tracking = more conservative training.

Reasoning

A fixed reference policy becomes increasingly irrelevant as the policy improves, making the KL penalty either too strong (preventing useful learning) or meaningless (too far from the current policy to provide signal). Polyak averaging maintains a "moving reference" that stays relevant. The alpha=0.6 default is aggressive compared to TD3's typical 0.005, reflecting the different dynamics of language model RL where policies change more slowly per step.

Code Evidence

Configuration from `open_instruct/grpo_utils.py:88-92`:

alpha: float = 0.6
"""The alpha value for doing polyak updates (ref_param = alpha * param + (1 - alpha) * ref_param)
reference: [TR-DPO](https://huggingface.co/papers/2404.09656), but it's actually pretty commonly
used. E.g., [TD3](https://arxiv.org/abs/1802.09477)"""

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment