Heuristic:Allenai Open instruct Disable Dropout In RL
| Knowledge Sources | |
|---|---|
| Domains | Reinforcement_Learning, Optimization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Disable all dropout layers (set p=0) during reinforcement learning training to prevent noise that undermines the reward signal.
Description
In standard supervised learning, dropout acts as a regularizer. However, in on-policy RL (GRPO), dropout introduces stochastic noise that interferes with the reward signal and advantage estimation. Since the policy gradient already has high variance from the reward signal, adding dropout noise makes training less stable. The codebase explicitly sets all dropout modules to p=0 for both the policy and reference models.
Usage
Apply this heuristic for all on-policy RL training (GRPO, PPO). Not typically applied for SFT or DPO, where dropout regularization can still be beneficial.
The Insight (Rule of Thumb)
- Action: Call `disable_dropout_in_model(model)` on both policy and reference models before training.
- Value: Set `module.p = 0` for all `torch.nn.Dropout` instances.
- Trade-off: Loss of dropout regularization; mitigated by the KL penalty serving as implicit regularization in GRPO.
Reasoning
Policy gradient methods estimate the gradient from reward signals, which already have high variance. Dropout adds additional noise to the forward pass, making the gradient estimates even noisier. In GRPO specifically, the reference policy KL penalty already prevents overfitting (serving a similar role to dropout), so dropout becomes redundant and harmful.
Code Evidence
Dropout disabling utility from `open_instruct/model_utils.py:181-184`:
def disable_dropout_in_model(model: torch.nn.Module) -> None:
for module in model.modules():
if isinstance(module, torch.nn.Dropout):
module.p = 0