Principle:Allenai Open instruct GRPO Experiment Configuration
| Knowledge Sources | |
|---|---|
| Domains | Configuration Management Reinforcement Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
GRPO experiment configuration is the practice of specifying and validating the complete set of hyperparameters that control a GRPO reinforcement learning training run, including optimization, algorithm, distributed training, and experiment tracking settings.
Description
A GRPO training run is governed by dozens of interrelated hyperparameters spanning multiple subsystems. Proper configuration management ensures:
- Correctness: Invalid parameter combinations are caught early through validation (e.g.,
use_vllm_logprobsis incompatible with truncated importance sampling). - Reproducibility: All parameters are logged and saved, enabling exact reproduction of experiments.
- Convenience: Sensible defaults reduce the number of parameters that need explicit specification.
- Modularity: Configuration is split into logical groups: optimization, algorithm, distributed training, experiment tracking, and infrastructure.
The configuration covers these parameter groups:
- Optimization: Learning rate, scheduler type, warmup steps, weight decay, gradient clipping, fused optimizer.
- Algorithm: KL coefficient (beta), clipping bounds (clip_lower, clip_higher), loss function variant (DAPO vs CISPO), KL estimator selection, loss denominator strategy, reference policy update frequency and Polyak averaging coefficient.
- Batch sizing: Per-device batch size, total episodes, number of mini-batches, number of epochs per rollout.
- Distributed training: Number of learners per node, DeepSpeed stage, ZeRO partition group size, sequence parallelism, parameter/optimizer offloading, model gathering strategy.
- Checkpointing: Save frequency, checkpoint state frequency, output directory, HuggingFace Hub upload settings, Google Cloud Storage backup.
- Experiment tracking: W&B project, entity, verbose logging, evaluation intervals.
- Infrastructure: Backend timeout, single-GPU debug mode, queue dashboard settings.
Usage
The experiment configuration is typically populated from command-line arguments via a dataclass parser and passed as the central configuration object to all components of the GRPO pipeline. It is the single source of truth for all training hyperparameters.
Theoretical Basis
Key Hyperparameter Relationships
Effective batch size:
effective_batch_size = num_unique_prompts * num_samples_per_prompt
gradient_accumulation = effective_batch_size / (per_device_batch_size * world_size * num_mini_batches)
KL penalty strength:
total_loss = policy_loss + beta * kl_divergence
beta = 0: No KL penalty (pure reward maximization, risk of reward hacking)
beta > 0: Regularization toward reference model
beta too large: Training progress is too slow (policy stuck near reference)
Clipping parameters:
clip_lower = 0.2, clip_higher = 0.2: Standard PPO clipping (symmetric)
clip_lower = 0.2, clip_higher = 0.28: DAPO-style asymmetric (more exploration)
clip_lower = 0.0, clip_higher = 0.28: Extreme asymmetry (only clips upward)
Reference policy update:
ref_policy_update_freq = None: Static reference model (never updated)
ref_policy_update_freq = K: Polyak update every K steps
ref_param = alpha * param + (1 - alpha) * ref_param
alpha = 0.6 (default): Fast tracking of the current policy
alpha = 0.01: Slow tracking (closer to fixed reference)
Loss denominator:
loss_denominator = "token": Standard per-token averaging (most common)
loss_denominator = "1024": Fixed denominator (Dr GRPO style)
Decouples loss magnitude from batch composition
Useful for very variable sequence lengths
Validation Rules
The configuration enforces several invariants:
- Cannot use
use_vllm_logprobswith truncated importance sampling (contradiction). loss_denominatormust be "token" or a positive float.checkpoint_state_dirrequirescheckpoint_state_freq > 0and vice versa.sequence_parallel_size > 1requires DeepSpeed stage 3.load_ref_policy=Falserequiresbeta=0.0.