Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct Actor Coordination

From Leeroopedia


Knowledge Sources
Domains Distributed Computing Reinforcement Learning
Last Updated 2026-02-07 00:00 GMT

Overview

Actor coordination is the practice of managing and synchronizing heterogeneous distributed actors (learners, inference engines, data preparers) in an asynchronous RL training pipeline.

Description

The GRPO training pipeline consists of multiple actor types that must work in concert:

  • vLLM inference engines (LLMRayActor): Generate completions from the current policy.
  • Policy learners (PolicyTrainerRayProcess): Perform forward/backward passes and update model weights.
  • Data preparation actor (DataPreparationActor): Orchestrate prompt feeding, result accumulation, reward computation, and batch packing.
  • Actor manager (ActorManager): Central coordination point for lifecycle management, queue monitoring, and performance tracking.

The coordination challenges include:

  1. Weight synchronization: After each training step, updated weights must be broadcast from the learners to the inference engines. This requires synchronizing across different Ray actor types that may be on different machines.
  2. Queue management: Prompts flow from the data preparation actor to inference engines via a prompt queue, and results flow back via a results queue. Queue sizes must be monitored to detect deadlocks or backpressure.
  3. Lifecycle management: When training completes (or fails), all actors must be notified to shut down gracefully. The should_stop signal propagates from the main loop through the actor manager to all engine actors.
  4. Performance monitoring: The actor manager tracks token throughput (prefill and decode), training step durations, generation batch durations, and KV-cache utilization. It exposes a web dashboard for real-time monitoring.
  5. Evaluation coordination: Evaluation prompts are interleaved with training prompts, with results routed to a separate evaluation queue.

Usage

Actor coordination is implicit in every GRPO training run. The ActorManager is created on the head node and passed to all other actors as a Ray actor handle. It serves as the single source of truth for training state (running vs. stopping) and performance metrics.

Theoretical Basis

The coordination pattern follows a centralized controller architecture:

                    ActorManager
                   /     |      \
                  /      |       \
    PromptQueue  /       |        \  ResultsQueue
                /        |         \
    vLLM_Engine_1  vLLM_Engine_2  ...  vLLM_Engine_N
                \        |         /
                 \       |        /
              DataPreparationActor
                        |
                PolicyTrainer_1 ... PolicyTrainer_M

The producer-consumer pattern between the data preparation actor and the inference engines is managed via Ray queues:

DataPreparationActor:
    for each step:
        push prompts to prompt_queue
        pull results from results_queue
        compute rewards and advantages
        pack sequences
        distribute to learner ranks

Each vLLM engine:
    while not should_stop:
        pull prompt from prompt_queue
        generate completion
        compute reward (optionally)
        push result to results_queue

Each PolicyTrainer:
    while has_data:
        pull collated batch from DataPreparationActor
        forward/backward pass
        broadcast weights to vLLM engines

Deadlock prevention: The system avoids deadlocks through several mechanisms:

  • Queue timeouts with retry logic in the accumulation function.
  • The should_stop signal allows engines to break out of their processing loops.
  • The data preparation actor sends a ShutdownSentinel to signal completion.
  • Background thread monitoring in the actor manager detects stuck actors.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment