Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct GRPO Experiment Configuration

From Leeroopedia


Knowledge Sources
Domains Configuration Management Reinforcement Learning
Last Updated 2026-02-07 00:00 GMT

Overview

GRPO experiment configuration is the practice of specifying and validating the complete set of hyperparameters that control a GRPO reinforcement learning training run, including optimization, algorithm, distributed training, and experiment tracking settings.

Description

A GRPO training run is governed by dozens of interrelated hyperparameters spanning multiple subsystems. Proper configuration management ensures:

  • Correctness: Invalid parameter combinations are caught early through validation (e.g., use_vllm_logprobs is incompatible with truncated importance sampling).
  • Reproducibility: All parameters are logged and saved, enabling exact reproduction of experiments.
  • Convenience: Sensible defaults reduce the number of parameters that need explicit specification.
  • Modularity: Configuration is split into logical groups: optimization, algorithm, distributed training, experiment tracking, and infrastructure.

The configuration covers these parameter groups:

  1. Optimization: Learning rate, scheduler type, warmup steps, weight decay, gradient clipping, fused optimizer.
  2. Algorithm: KL coefficient (beta), clipping bounds (clip_lower, clip_higher), loss function variant (DAPO vs CISPO), KL estimator selection, loss denominator strategy, reference policy update frequency and Polyak averaging coefficient.
  3. Batch sizing: Per-device batch size, total episodes, number of mini-batches, number of epochs per rollout.
  4. Distributed training: Number of learners per node, DeepSpeed stage, ZeRO partition group size, sequence parallelism, parameter/optimizer offloading, model gathering strategy.
  5. Checkpointing: Save frequency, checkpoint state frequency, output directory, HuggingFace Hub upload settings, Google Cloud Storage backup.
  6. Experiment tracking: W&B project, entity, verbose logging, evaluation intervals.
  7. Infrastructure: Backend timeout, single-GPU debug mode, queue dashboard settings.

Usage

The experiment configuration is typically populated from command-line arguments via a dataclass parser and passed as the central configuration object to all components of the GRPO pipeline. It is the single source of truth for all training hyperparameters.

Theoretical Basis

Key Hyperparameter Relationships

Effective batch size:

effective_batch_size = num_unique_prompts * num_samples_per_prompt
gradient_accumulation = effective_batch_size / (per_device_batch_size * world_size * num_mini_batches)

KL penalty strength:

total_loss = policy_loss + beta * kl_divergence

beta = 0: No KL penalty (pure reward maximization, risk of reward hacking)
beta > 0: Regularization toward reference model
beta too large: Training progress is too slow (policy stuck near reference)

Clipping parameters:

clip_lower = 0.2, clip_higher = 0.2: Standard PPO clipping (symmetric)
clip_lower = 0.2, clip_higher = 0.28: DAPO-style asymmetric (more exploration)
clip_lower = 0.0, clip_higher = 0.28: Extreme asymmetry (only clips upward)

Reference policy update:

ref_policy_update_freq = None: Static reference model (never updated)
ref_policy_update_freq = K: Polyak update every K steps
    ref_param = alpha * param + (1 - alpha) * ref_param
    alpha = 0.6 (default): Fast tracking of the current policy
    alpha = 0.01: Slow tracking (closer to fixed reference)

Loss denominator:

loss_denominator = "token": Standard per-token averaging (most common)
loss_denominator = "1024": Fixed denominator (Dr GRPO style)
    Decouples loss magnitude from batch composition
    Useful for very variable sequence lengths

Validation Rules

The configuration enforces several invariants:

  • Cannot use use_vllm_logprobs with truncated importance sampling (contradiction).
  • loss_denominator must be "token" or a positive float.
  • checkpoint_state_dir requires checkpoint_state_freq > 0 and vice versa.
  • sequence_parallel_size > 1 requires DeepSpeed stage 3.
  • load_ref_policy=False requires beta=0.0.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment