Heuristic:Allenai Open instruct Disable Dropout In RL

Knowledge Sources	Open Instruct team
Domains	Reinforcement_Learning, Optimization
Last Updated	2026-02-07 00:00 GMT

Overview

Disable all dropout layers (set p=0) during reinforcement learning training to prevent noise that undermines the reward signal.

Description

In standard supervised learning, dropout acts as a regularizer. However, in on-policy RL (GRPO), dropout introduces stochastic noise that interferes with the reward signal and advantage estimation. Since the policy gradient already has high variance from the reward signal, adding dropout noise makes training less stable. The codebase explicitly sets all dropout modules to p=0 for both the policy and reference models.

Usage

Apply this heuristic for all on-policy RL training (GRPO, PPO). Not typically applied for SFT or DPO, where dropout regularization can still be beneficial.

The Insight (Rule of Thumb)

Action: Call `disable_dropout_in_model(model)` on both policy and reference models before training.
Value: Set `module.p = 0` for all `torch.nn.Dropout` instances.
Trade-off: Loss of dropout regularization; mitigated by the KL penalty serving as implicit regularization in GRPO.

Reasoning

Policy gradient methods estimate the gradient from reward signals, which already have high variance. Dropout adds additional noise to the forward pass, making the gradient estimates even noisier. In GRPO specifically, the reference policy KL penalty already prevents overfitting (serving a similar role to dropout), so dropout becomes redundant and harmful.

Code Evidence

Dropout disabling utility from `open_instruct/model_utils.py:181-184`:

def disable_dropout_in_model(model: torch.nn.Module) -> None:
    for module in model.modules():
        if isinstance(module, torch.nn.Dropout):
            module.p = 0

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment