Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Allenai Open instruct Gradient Clipping Norm

From Leeroopedia




Knowledge Sources
Domains Optimization, Reinforcement_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Clip gradient norms to 1.0 as the default maximum to prevent exploding gradients during training.

Description

Gradient clipping is applied via DeepSpeed's built-in gradient clipping mechanism with a maximum norm of 1.0. This conservative default prevents training instability from large gradient spikes, which are common in reinforcement learning (where reward signals can produce highly variable gradients) and in early training stages.

Usage

Apply this heuristic for all GRPO training. The default of 1.0 is suitable for most configurations. Increase to 2.0-5.0 if gradients are being clipped too aggressively (visible in gradient norm logs). Decrease below 1.0 if training is unstable.

The Insight (Rule of Thumb)

  • Action: Set `max_grad_norm = 1.0` in the ExperimentConfig and let DeepSpeed handle clipping.
  • Value: 1.0 (L2 norm).
  • Trade-off: Very conservative; may slow convergence if gradients are consistently clipped. Monitor gradient norms during training.

Reasoning

In GRPO, the loss combines a policy gradient term with a KL penalty. When the policy diverges significantly from the reference (e.g., after a batch with high-reward outliers), the gradient can spike. Without clipping, these spikes cause parameter updates that are too large, leading to training instability or divergence. The value 1.0 is a widely-used default in both supervised and RL training.

Code Evidence

Default configuration from `open_instruct/grpo_utils.py:45`:

max_grad_norm: float = 1.0
"""Maximum gradient norm for gradient clipping."""

DeepSpeed config integration from `open_instruct/utils.py:1428`:

"gradient_clipping": max_norm

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment