Heuristic:Allenai Open instruct Gradient Clipping Norm
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Reinforcement_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Clip gradient norms to 1.0 as the default maximum to prevent exploding gradients during training.
Description
Gradient clipping is applied via DeepSpeed's built-in gradient clipping mechanism with a maximum norm of 1.0. This conservative default prevents training instability from large gradient spikes, which are common in reinforcement learning (where reward signals can produce highly variable gradients) and in early training stages.
Usage
Apply this heuristic for all GRPO training. The default of 1.0 is suitable for most configurations. Increase to 2.0-5.0 if gradients are being clipped too aggressively (visible in gradient norm logs). Decrease below 1.0 if training is unstable.
The Insight (Rule of Thumb)
- Action: Set `max_grad_norm = 1.0` in the ExperimentConfig and let DeepSpeed handle clipping.
- Value: 1.0 (L2 norm).
- Trade-off: Very conservative; may slow convergence if gradients are consistently clipped. Monitor gradient norms during training.
Reasoning
In GRPO, the loss combines a policy gradient term with a KL penalty. When the policy diverges significantly from the reference (e.g., after a batch with high-reward outliers), the gradient can spike. Without clipping, these spikes cause parameter updates that are too large, leading to training instability or divergence. The value 1.0 is a widely-used default in both supervised and RL training.
Code Evidence
Default configuration from `open_instruct/grpo_utils.py:45`:
max_grad_norm: float = 1.0
"""Maximum gradient norm for gradient clipping."""
DeepSpeed config integration from `open_instruct/utils.py:1428`:
"gradient_clipping": max_norm