Principle:Hpcaitech ColossalAI GRPO Consumer Setup

Knowledge Sources	ColossalAI DeepSeekMath: Pushing the Limits of Mathematical Reasoning
Domains	Reinforcement_Learning, Distributed_Computing
Last Updated	2026-02-09 00:00 GMT

Overview

A training worker pattern that receives experiences from producers and updates the policy model using the GRPO objective with ColossalAI distributed training.

Description

The GRPO Consumer is a training worker that receives experience batches (generated responses, log probabilities, rewards, advantages) from producer actors and updates the policy model. It uses ColossalAI's Booster for distributed training, supporting ZeRO and hybrid parallelism. After each update, it broadcasts updated weights back to producers.

Usage

Consumers are automatically created by launch_distributed(). Configure the number of consumer GPUs based on model size and available memory.

Theoretical Basis

The consumer minimizes the GRPO policy loss with importance sampling:

$ℒ = - 𝔼 [\min (\frac{π_{θ}}{π_{o l d}} A, clip (\frac{π_{θ}}{π_{o l d}}, 1 - ϵ, 1 + ϵ) A)] + β \cdot K L (π_{θ} | | π_{r e f})$

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_GRPOConsumer

Heuristic Links

Heuristic:Hpcaitech_ColossalAI_Empty_Cache_Between_Phases

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment