Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Huggingface Open r1 GRPO Training

From Leeroopedia


Metadata

Field Value
Sources Paper: DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models; Paper: DeepSeek-R1; Doc: TRL GRPOTrainer
Domains NLP, Reinforcement_Learning, Training
Last Updated 2026-02-08 00:00 GMT

Overview

A reinforcement learning training algorithm that improves language model reasoning by sampling multiple completions per prompt and using group-relative reward normalization to compute policy gradients.

Description

Group Relative Policy Optimization (GRPO) is a reinforcement learning algorithm designed for training language models to reason. Unlike PPO, which uses a separate critic/value model to estimate a baseline, GRPO generates a group of completions for each prompt and normalizes rewards within the group. This eliminates the need for a value network while still providing a stable training signal.

The algorithm operates as follows:

  1. Sample completions. For each prompt, generate N completions from the current policy.
  2. Score completions. Evaluate each completion using one or more reward functions (e.g., correctness, format compliance).
  3. Normalize rewards within group. Compute per-prompt (group-relative) advantages by subtracting the group mean reward and dividing by the group standard deviation.
  4. Compute policy gradients. Use the normalized advantages to weight the log-probability of each completion under the current policy.
  5. Apply KL penalty. Add a KL divergence penalty to prevent the policy from diverging too far from the reference model.

GRPO is particularly effective for mathematical reasoning and code generation tasks where reward functions (correctness verification, format compliance, code execution results) can be defined programmatically. By removing the critic network, GRPO halves the memory requirements compared to PPO while maintaining training stability through group-relative normalization.

Usage

Use GRPO when training models to improve reasoning capabilities (math, code) where you can define programmatic reward functions that evaluate the correctness or quality of model outputs. GRPO is preferred over supervised fine-tuning (SFT) when you want the model to discover novel reasoning patterns rather than just imitating teacher outputs.

Typical scenarios include:

  • Mathematical reasoning — training on datasets with verifiable numerical answers.
  • Code generation — training on programming problems with executable test cases.
  • Multi-objective optimization — combining multiple reward signals (accuracy, format, length, repetition) with configurable weights.

Theoretical Basis

The GRPO algorithm replaces the learned value function of PPO with a group-relative normalization scheme. The core formulation is:

GRPO Algorithm
==============

Input:
  - policy pi (current language model)
  - reference policy pi_ref (frozen copy of original model)
  - reward functions R_1, R_2, ..., R_K
  - group size G (number of completions per prompt)
  - KL penalty coefficient beta

For each training step:
  1. Sample a batch of prompts {x_1, x_2, ..., x_B}

  2. For each prompt x_i:
     a. Sample G completions from current policy:
        {o_1, o_2, ..., o_G} ~ pi(. | x_i)

     b. Compute rewards for each completion:
        r_j = sum_k( w_k * R_k(x_i, o_j) )   for j = 1..G

     c. Normalize rewards within the group:
        mean_r = (1/G) * sum_j(r_j)
        std_r  = sqrt( (1/G) * sum_j( (r_j - mean_r)^2 ) )
        advantage_j = (r_j - mean_r) / std_r   for j = 1..G

  3. Compute policy gradient loss:
     L = -E[ advantage_j * log pi(o_j | x_i) ]

  4. Add KL divergence penalty:
     L_total = L + beta * KL( pi || pi_ref )

  5. Update policy pi using L_total

The key insight is that by normalizing rewards within each group (per-prompt), the algorithm obtains a stable baseline without requiring a separate value network. Completions that score above the group average receive positive advantage (encouraging the policy to increase their probability), while those below the group average receive negative advantage (discouraging them). The KL penalty term prevents the policy from collapsing to a narrow distribution and maintains generation diversity.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment