Workflow:Hpcaitech ColossalAI Distributed GRPO Training

Knowledge Sources	ColossalAI GRPO Paper ColossalChat Distributed RL README
Domains	LLMs, RLHF, Distributed_Training, Reinforcement_Learning
Last Updated	2026-02-09 03:00 GMT

Overview

End-to-end process for distributed Group Relative Policy Optimization (GRPO) training using a Ray-based producer-consumer architecture for scalable reinforcement learning from verifiable rewards.

Description

This workflow implements distributed GRPO training using ColossalAI's Ray-based producer-consumer framework. Producer processes generate multiple responses per prompt using the current policy, while consumer processes train the policy model using group-relative advantages computed from reward scores. The architecture supports verifiable reward functions (math, code verification) and learned reward models. It also supports algorithm variants including DAPO, REINFORCE++, and RLOO. The zero-bubble pipeline variant further optimizes GPU utilization by overlapping inference and training phases.

Usage

Execute this workflow when you need to train a language model using reinforcement learning with verifiable rewards at scale across multiple nodes. This is particularly suitable for math reasoning (GSM8K, competition math) and code generation tasks where reward signals can be computed automatically. GRPO is preferred over PPO when you want to eliminate the critic model and use group-relative advantages instead.

Execution Steps

Step 1: Dataset and Reward Configuration

Prepare a prompt dataset in JSONL format and configure the reward computation strategy. Rewards can come from verifiable functions (math answer checking, code execution) or from a learned reward model.

Key considerations:

Prompt dataset contains instruction prompts without completions
For math tasks: configure response format tags (think/answer tags for chain-of-thought)
For code tasks: configure code verification server for execution-based rewards
Reward function types: "think_answer_tags", "boxed", or "code"

Step 2: Ray Cluster Initialization

Initialize the Ray cluster for distributed execution. The launcher queries available nodes and GPUs, then allocates producer and consumer processes across the cluster.

What happens:

Initialize Ray with local or multi-node addressing
Query node resources and GPU availability
Allocate producer processes to inference-optimized nodes
Allocate consumer processes to training-optimized nodes
Configure communication between producer and consumer actors

Step 3: Producer Setup (Inference Workers)

Create producer processes that handle response generation. Each producer loads the inference model with a configurable backend (HuggingFace Transformers or vLLM) and generates multiple responses per prompt.

What happens:

Load inference model with selected backend
Configure generation parameters (temperature, top-k, top-p, max tokens)
Support temperature annealing for curriculum-style training
Generate num_generations responses per prompt in batch

Step 4: Consumer Setup (Training Workers)

Create consumer processes that train the actor model using policy gradient updates. Consumers receive generated experiences from producers and apply GRPO policy loss with optional pretraining auxiliary loss.

What happens:

Initialize actor model with ColossalAI booster and selected parallelism plugin
Load reference model for KL divergence computation
Configure GRPOConsumer with loss parameters (KL coefficient, clip epsilon)
Support multiple algorithm variants (GRPO, DAPO, REINFORCE++, RLOO)

Step 5: Experience Collection

Producers generate responses, compute rewards, and package experiences. Each experience contains the prompt, generated sequences, log probabilities, reward scores, and computed advantages.

What happens:

Producers sample prompts from the dataset
Generate multiple completions per prompt
Score completions using reward functions or reward model
Compute group-relative advantages (normalize rewards within each prompt group)
Calculate KL divergence between current policy and reference model
Package experiences and send to consumers via Ray

Step 6: Policy Training

Consumers receive experiences and perform policy gradient updates. The GRPO loss encourages the model to increase probability of high-reward responses and decrease probability of low-reward responses within each group.

What happens per update:

Compute current action log probabilities from the actor model
Calculate policy loss using clipped importance sampling with group-relative advantages
Optionally compute pretraining auxiliary loss (PTX) to prevent catastrophic forgetting
Backward pass and gradient accumulation
Optimizer and scheduler step
Synchronize updated weights back to producers

Step 7: Weight Synchronization and Iteration

After each training update, synchronize the updated model weights from consumers back to producers for the next round of experience collection. This producer-consumer cycle repeats for the configured number of episodes.

Key considerations:

Weight synchronization uses Ray object store for efficient transfer
The n_behind parameter controls how many updates producers can lag behind consumers
Evaluation can be triggered at configurable intervals during training

Step 8: Checkpointing and Evaluation

Save model checkpoints at configured intervals and optionally run evaluation on a held-out dataset. Rollout samples can be logged for qualitative analysis.

Key considerations:

Checkpoints saved with configurable interval
Evaluation dataset can differ from training prompts
Rollout logs capture sample responses for inspection
Metrics logged to Weights & Biases including rewards, KL divergence, and loss

Execution Diagram

GitHub URL

Workflow Repository