Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct LLMRayActor

From Leeroopedia


Type Class (Ray Actor)
Source open_instruct/vllm_utils.py:L560-1240
Dependencies vllm, ray, torch, openai, asyncio, threading
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete Ray actor for high-throughput asynchronous text generation with optional tool support, wrapping a vLLM engine for use in the GRPO training pipeline, provided by the Open Instruct library.

Description

LLMRayActor is a Ray actor class that encapsulates a full vLLM async inference engine. Each actor instance:

  1. Initializes a vLLM AsyncLLMEngine with the policy model weights.
  2. Starts an internal OpenAI-compatible API server (uvicorn) on a local port.
  3. Runs a background prefetch thread that pulls prompts from a shared Ray queue.
  4. Runs a background processing thread that handles completion outputs.
  5. Supports optional tool use: when tool actors are configured, the engine can parse tool calls from model output, execute them via Ray remote calls, append tool results, and continue generation.
  6. Supports inflight weight updates: model weights can be updated while requests are in flight.
  7. Supports reward computation: when a RewardConfig is provided, rewards are computed inline on the engine actor, reducing data transfer.

The actor communicates via three Ray queues:

  • prompt_queue: Receives PromptRequest objects from the data preparation actor.
  • results_queue: Sends GenerationResult objects back for training data.
  • eval_results_queue: Sends evaluation generation results.

Usage

LLMRayActor instances are created via the create_vllm_engines() factory function, which handles Ray placement groups, resource allocation, and engine initialization. Users do not typically instantiate this class directly.

Code Reference

Source Location

Signature

class LLMRayActor:
    def __init__(
        self,
        *args,
        tool_actors: list[ray.actor.ActorHandle] | None = None,
        tool_parser_type: str = "legacy",
        max_tool_calls: int = 5,
        mask_tool_use: bool = True,
        bundle_indices: list[int] | None = None,
        prompt_queue: ray_queue.Queue,
        results_queue: ray_queue.Queue,
        eval_results_queue: ray_queue.Queue,
        actor_manager: ray.actor.ActorHandle,
        inflight_updates: bool,
        reward_config: RewardConfig | None = None,
        train_dataset=None,
        eval_dataset=None,
        **kwargs,  # vLLM engine args: model, tensor_parallel_size, etc.
    ):

Import

from open_instruct.vllm_utils import LLMRayActor, create_vllm_engines

I/O Contract

Inputs

Name Type Description
model (via kwargs) str Model name or path for the vLLM engine.
tensor_parallel_size (via kwargs) int Number of GPUs for tensor parallelism within this engine.
gpu_memory_utilization (via kwargs) float Fraction of GPU memory to use for KV-cache (default 0.9).
enforce_eager (via kwargs) bool Whether to disable CUDA graph optimization.
max_model_len (via kwargs) int Maximum sequence length the engine can handle.
prompt_queue ray_queue.Queue Queue providing PromptRequest objects with tokenized prompts and generation config.
tool_actors None Optional list of tool Ray actors for agentic generation.

Outputs

Name Type Description
results_queue ray_queue.Queue Queue receiving GenerationResult objects containing response token IDs, log-probabilities, finish reasons, tool interaction metadata, and reward scores.
eval_results_queue ray_queue.Queue Queue receiving evaluation generation results (identical schema).

Key Methods

Method Description
init_process_group(...) Initialize a torch distributed process group for weight synchronization with learner GPUs.
update_weight(name, dtype, shape) Receive updated model weights via NCCL broadcast from the learner.
update_weight_cuda_ipc(name, dtype, shape, ipc_handles) Receive updated weights via CUDA IPC (same-node optimization).
reset_prefix_cache() Clear the vLLM prefix cache after weight updates.
get_kv_cache_info() Return the maximum number of concurrent sequences the KV-cache can support.
get_model_dims() Return model dimensions (hidden size, num layers, vocab size).

Usage Examples

from open_instruct.vllm_utils import create_vllm_engines

# Create 2 vLLM engines, each with 1 GPU
engines = create_vllm_engines(
    num_engines=2,
    tensor_parallel_size=1,
    enforce_eager=False,
    tokenizer_name_or_path="allenai/OLMo-2-7B",
    pretrain="allenai/OLMo-2-7B",
    revision=None,
    seed=42,
    enable_prefix_caching=True,
    max_model_len=4096,
    vllm_gpu_memory_utilization=0.9,
    prompt_queue=prompt_queue,
    results_queue=results_queue,
    eval_results_queue=eval_results_queue,
    actor_manager=actor_manager,
)

# Engines are now running and pulling from prompt_queue automatically
# Results appear in results_queue as GenerationResult objects

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment