Principle:Allenai Open instruct Actor Coordination
| Knowledge Sources | |
|---|---|
| Domains | Distributed Computing Reinforcement Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Actor coordination is the practice of managing and synchronizing heterogeneous distributed actors (learners, inference engines, data preparers) in an asynchronous RL training pipeline.
Description
The GRPO training pipeline consists of multiple actor types that must work in concert:
- vLLM inference engines (LLMRayActor): Generate completions from the current policy.
- Policy learners (PolicyTrainerRayProcess): Perform forward/backward passes and update model weights.
- Data preparation actor (DataPreparationActor): Orchestrate prompt feeding, result accumulation, reward computation, and batch packing.
- Actor manager (ActorManager): Central coordination point for lifecycle management, queue monitoring, and performance tracking.
The coordination challenges include:
- Weight synchronization: After each training step, updated weights must be broadcast from the learners to the inference engines. This requires synchronizing across different Ray actor types that may be on different machines.
- Queue management: Prompts flow from the data preparation actor to inference engines via a prompt queue, and results flow back via a results queue. Queue sizes must be monitored to detect deadlocks or backpressure.
- Lifecycle management: When training completes (or fails), all actors must be notified to shut down gracefully. The
should_stopsignal propagates from the main loop through the actor manager to all engine actors. - Performance monitoring: The actor manager tracks token throughput (prefill and decode), training step durations, generation batch durations, and KV-cache utilization. It exposes a web dashboard for real-time monitoring.
- Evaluation coordination: Evaluation prompts are interleaved with training prompts, with results routed to a separate evaluation queue.
Usage
Actor coordination is implicit in every GRPO training run. The ActorManager is created on the head node and passed to all other actors as a Ray actor handle. It serves as the single source of truth for training state (running vs. stopping) and performance metrics.
Theoretical Basis
The coordination pattern follows a centralized controller architecture:
ActorManager
/ | \
/ | \
PromptQueue / | \ ResultsQueue
/ | \
vLLM_Engine_1 vLLM_Engine_2 ... vLLM_Engine_N
\ | /
\ | /
DataPreparationActor
|
PolicyTrainer_1 ... PolicyTrainer_M
The producer-consumer pattern between the data preparation actor and the inference engines is managed via Ray queues:
DataPreparationActor:
for each step:
push prompts to prompt_queue
pull results from results_queue
compute rewards and advantages
pack sequences
distribute to learner ranks
Each vLLM engine:
while not should_stop:
pull prompt from prompt_queue
generate completion
compute reward (optionally)
push result to results_queue
Each PolicyTrainer:
while has_data:
pull collated batch from DataPreparationActor
forward/backward pass
broadcast weights to vLLM engines
Deadlock prevention: The system avoids deadlocks through several mechanisms:
- Queue timeouts with retry logic in the accumulation function.
- The
should_stopsignal allows engines to break out of their processing loops. - The data preparation actor sends a
ShutdownSentinelto signal completion. - Background thread monitoring in the actor manager detects stuck actors.