Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Hpcaitech ColossalAI Distributed GRPO Training

From Leeroopedia


Knowledge Sources
Domains LLMs, RLHF, Distributed_Training, Reinforcement_Learning
Last Updated 2026-02-09 03:00 GMT

Overview

End-to-end process for distributed Group Relative Policy Optimization (GRPO) training using a Ray-based producer-consumer architecture for scalable reinforcement learning from verifiable rewards.

Description

This workflow implements distributed GRPO training using ColossalAI's Ray-based producer-consumer framework. Producer processes generate multiple responses per prompt using the current policy, while consumer processes train the policy model using group-relative advantages computed from reward scores. The architecture supports verifiable reward functions (math, code verification) and learned reward models. It also supports algorithm variants including DAPO, REINFORCE++, and RLOO. The zero-bubble pipeline variant further optimizes GPU utilization by overlapping inference and training phases.

Usage

Execute this workflow when you need to train a language model using reinforcement learning with verifiable rewards at scale across multiple nodes. This is particularly suitable for math reasoning (GSM8K, competition math) and code generation tasks where reward signals can be computed automatically. GRPO is preferred over PPO when you want to eliminate the critic model and use group-relative advantages instead.

Execution Steps

Step 1: Dataset and Reward Configuration

Prepare a prompt dataset in JSONL format and configure the reward computation strategy. Rewards can come from verifiable functions (math answer checking, code execution) or from a learned reward model.

Key considerations:

  • Prompt dataset contains instruction prompts without completions
  • For math tasks: configure response format tags (think/answer tags for chain-of-thought)
  • For code tasks: configure code verification server for execution-based rewards
  • Reward function types: "think_answer_tags", "boxed", or "code"

Step 2: Ray Cluster Initialization

Initialize the Ray cluster for distributed execution. The launcher queries available nodes and GPUs, then allocates producer and consumer processes across the cluster.

What happens:

  • Initialize Ray with local or multi-node addressing
  • Query node resources and GPU availability
  • Allocate producer processes to inference-optimized nodes
  • Allocate consumer processes to training-optimized nodes
  • Configure communication between producer and consumer actors

Step 3: Producer Setup (Inference Workers)

Create producer processes that handle response generation. Each producer loads the inference model with a configurable backend (HuggingFace Transformers or vLLM) and generates multiple responses per prompt.

What happens:

  • Load inference model with selected backend
  • Configure generation parameters (temperature, top-k, top-p, max tokens)
  • Support temperature annealing for curriculum-style training
  • Generate num_generations responses per prompt in batch

Step 4: Consumer Setup (Training Workers)

Create consumer processes that train the actor model using policy gradient updates. Consumers receive generated experiences from producers and apply GRPO policy loss with optional pretraining auxiliary loss.

What happens:

  • Initialize actor model with ColossalAI booster and selected parallelism plugin
  • Load reference model for KL divergence computation
  • Configure GRPOConsumer with loss parameters (KL coefficient, clip epsilon)
  • Support multiple algorithm variants (GRPO, DAPO, REINFORCE++, RLOO)

Step 5: Experience Collection

Producers generate responses, compute rewards, and package experiences. Each experience contains the prompt, generated sequences, log probabilities, reward scores, and computed advantages.

What happens:

  • Producers sample prompts from the dataset
  • Generate multiple completions per prompt
  • Score completions using reward functions or reward model
  • Compute group-relative advantages (normalize rewards within each prompt group)
  • Calculate KL divergence between current policy and reference model
  • Package experiences and send to consumers via Ray

Step 6: Policy Training

Consumers receive experiences and perform policy gradient updates. The GRPO loss encourages the model to increase probability of high-reward responses and decrease probability of low-reward responses within each group.

What happens per update:

  • Compute current action log probabilities from the actor model
  • Calculate policy loss using clipped importance sampling with group-relative advantages
  • Optionally compute pretraining auxiliary loss (PTX) to prevent catastrophic forgetting
  • Backward pass and gradient accumulation
  • Optimizer and scheduler step
  • Synchronize updated weights back to producers

Step 7: Weight Synchronization and Iteration

After each training update, synchronize the updated model weights from consumers back to producers for the next round of experience collection. This producer-consumer cycle repeats for the configured number of episodes.

Key considerations:

  • Weight synchronization uses Ray object store for efficient transfer
  • The n_behind parameter controls how many updates producers can lag behind consumers
  • Evaluation can be triggered at configurable intervals during training

Step 8: Checkpointing and Evaluation

Save model checkpoints at configured intervals and optionally run evaluation on a held-out dataset. Rollout samples can be logged for qualitative analysis.

Key considerations:

  • Checkpoints saved with configurable interval
  • Evaluation dataset can differ from training prompts
  • Rollout logs capture sample responses for inspection
  • Metrics logged to Weights & Biases including rewards, KL divergence, and loss

Execution Diagram

GitHub URL

Workflow Repository