Implementation:Allenai Open instruct LLMRayActor

Type	Class (Ray Actor)
Source	`open_instruct/vllm_utils.py:L560-1240`
Dependencies	vllm, ray, torch, openai, asyncio, threading
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete Ray actor for high-throughput asynchronous text generation with optional tool support, wrapping a vLLM engine for use in the GRPO training pipeline, provided by the Open Instruct library.

Description

LLMRayActor is a Ray actor class that encapsulates a full vLLM async inference engine. Each actor instance:

Initializes a vLLM AsyncLLMEngine with the policy model weights.
Starts an internal OpenAI-compatible API server (uvicorn) on a local port.
Runs a background prefetch thread that pulls prompts from a shared Ray queue.
Runs a background processing thread that handles completion outputs.
Supports optional tool use: when tool actors are configured, the engine can parse tool calls from model output, execute them via Ray remote calls, append tool results, and continue generation.
Supports inflight weight updates: model weights can be updated while requests are in flight.
Supports reward computation: when a RewardConfig is provided, rewards are computed inline on the engine actor, reducing data transfer.

The actor communicates via three Ray queues:

prompt_queue: Receives PromptRequest objects from the data preparation actor.
results_queue: Sends GenerationResult objects back for training data.
eval_results_queue: Sends evaluation generation results.

Usage

LLMRayActor instances are created via the create_vllm_engines() factory function, which handles Ray placement groups, resource allocation, and engine initialization. Users do not typically instantiate this class directly.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/vllm_utils.py

Signature

class LLMRayActor:
    def __init__(
        self,
        *args,
        tool_actors: list[ray.actor.ActorHandle] | None = None,
        tool_parser_type: str = "legacy",
        max_tool_calls: int = 5,
        mask_tool_use: bool = True,
        bundle_indices: list[int] | None = None,
        prompt_queue: ray_queue.Queue,
        results_queue: ray_queue.Queue,
        eval_results_queue: ray_queue.Queue,
        actor_manager: ray.actor.ActorHandle,
        inflight_updates: bool,
        reward_config: RewardConfig | None = None,
        train_dataset=None,
        eval_dataset=None,
        **kwargs,  # vLLM engine args: model, tensor_parallel_size, etc.
    ):

Import

from open_instruct.vllm_utils import LLMRayActor, create_vllm_engines

I/O Contract

Inputs

Name	Type	Description
`model` (via kwargs)	`str`	Model name or path for the vLLM engine.
`tensor_parallel_size` (via kwargs)	`int`	Number of GPUs for tensor parallelism within this engine.
`gpu_memory_utilization` (via kwargs)	`float`	Fraction of GPU memory to use for KV-cache (default 0.9).
`enforce_eager` (via kwargs)	`bool`	Whether to disable CUDA graph optimization.
`max_model_len` (via kwargs)	`int`	Maximum sequence length the engine can handle.
`prompt_queue`	`ray_queue.Queue`	Queue providing `PromptRequest` objects with tokenized prompts and generation config.
`tool_actors`	None	Optional list of tool Ray actors for agentic generation.

Outputs

Name	Type	Description
`results_queue`	`ray_queue.Queue`	Queue receiving `GenerationResult` objects containing response token IDs, log-probabilities, finish reasons, tool interaction metadata, and reward scores.
`eval_results_queue`	`ray_queue.Queue`	Queue receiving evaluation generation results (identical schema).

Key Methods

Method	Description
`init_process_group(...)`	Initialize a torch distributed process group for weight synchronization with learner GPUs.
`update_weight(name, dtype, shape)`	Receive updated model weights via NCCL broadcast from the learner.
`update_weight_cuda_ipc(name, dtype, shape, ipc_handles)`	Receive updated weights via CUDA IPC (same-node optimization).
`reset_prefix_cache()`	Clear the vLLM prefix cache after weight updates.
`get_kv_cache_info()`	Return the maximum number of concurrent sequences the KV-cache can support.
`get_model_dims()`	Return model dimensions (hidden size, num layers, vocab size).

Usage Examples

from open_instruct.vllm_utils import create_vllm_engines

# Create 2 vLLM engines, each with 1 GPU
engines = create_vllm_engines(
    num_engines=2,
    tensor_parallel_size=1,
    enforce_eager=False,
    tokenizer_name_or_path="allenai/OLMo-2-7B",
    pretrain="allenai/OLMo-2-7B",
    revision=None,
    seed=42,
    enable_prefix_caching=True,
    max_model_len=4096,
    vllm_gpu_memory_utilization=0.9,
    prompt_queue=prompt_queue,
    results_queue=results_queue,
    eval_results_queue=eval_results_queue,
    actor_manager=actor_manager,
)

# Engines are now running and pulling from prompt_queue automatically
# Results appear in results_queue as GenerationResult objects

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment