Implementation:Allenai Open instruct LLMRayActor
Appearance
| Type | Class (Ray Actor) |
|---|---|
| Source | open_instruct/vllm_utils.py:L560-1240
|
| Dependencies | vllm, ray, torch, openai, asyncio, threading |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete Ray actor for high-throughput asynchronous text generation with optional tool support, wrapping a vLLM engine for use in the GRPO training pipeline, provided by the Open Instruct library.
Description
LLMRayActor is a Ray actor class that encapsulates a full vLLM async inference engine. Each actor instance:
- Initializes a vLLM
AsyncLLMEnginewith the policy model weights. - Starts an internal OpenAI-compatible API server (uvicorn) on a local port.
- Runs a background prefetch thread that pulls prompts from a shared Ray queue.
- Runs a background processing thread that handles completion outputs.
- Supports optional tool use: when tool actors are configured, the engine can parse tool calls from model output, execute them via Ray remote calls, append tool results, and continue generation.
- Supports inflight weight updates: model weights can be updated while requests are in flight.
- Supports reward computation: when a
RewardConfigis provided, rewards are computed inline on the engine actor, reducing data transfer.
The actor communicates via three Ray queues:
- prompt_queue: Receives
PromptRequestobjects from the data preparation actor. - results_queue: Sends
GenerationResultobjects back for training data. - eval_results_queue: Sends evaluation generation results.
Usage
LLMRayActor instances are created via the create_vllm_engines() factory function, which handles Ray placement groups, resource allocation, and engine initialization. Users do not typically instantiate this class directly.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/vllm_utils.py
Signature
class LLMRayActor:
def __init__(
self,
*args,
tool_actors: list[ray.actor.ActorHandle] | None = None,
tool_parser_type: str = "legacy",
max_tool_calls: int = 5,
mask_tool_use: bool = True,
bundle_indices: list[int] | None = None,
prompt_queue: ray_queue.Queue,
results_queue: ray_queue.Queue,
eval_results_queue: ray_queue.Queue,
actor_manager: ray.actor.ActorHandle,
inflight_updates: bool,
reward_config: RewardConfig | None = None,
train_dataset=None,
eval_dataset=None,
**kwargs, # vLLM engine args: model, tensor_parallel_size, etc.
):
Import
from open_instruct.vllm_utils import LLMRayActor, create_vllm_engines
I/O Contract
Inputs
| Name | Type | Description |
|---|---|---|
model (via kwargs) |
str |
Model name or path for the vLLM engine. |
tensor_parallel_size (via kwargs) |
int |
Number of GPUs for tensor parallelism within this engine. |
gpu_memory_utilization (via kwargs) |
float |
Fraction of GPU memory to use for KV-cache (default 0.9). |
enforce_eager (via kwargs) |
bool |
Whether to disable CUDA graph optimization. |
max_model_len (via kwargs) |
int |
Maximum sequence length the engine can handle. |
prompt_queue |
ray_queue.Queue |
Queue providing PromptRequest objects with tokenized prompts and generation config.
|
tool_actors |
None | Optional list of tool Ray actors for agentic generation. |
Outputs
| Name | Type | Description |
|---|---|---|
results_queue |
ray_queue.Queue |
Queue receiving GenerationResult objects containing response token IDs, log-probabilities, finish reasons, tool interaction metadata, and reward scores.
|
eval_results_queue |
ray_queue.Queue |
Queue receiving evaluation generation results (identical schema). |
Key Methods
| Method | Description |
|---|---|
init_process_group(...) |
Initialize a torch distributed process group for weight synchronization with learner GPUs. |
update_weight(name, dtype, shape) |
Receive updated model weights via NCCL broadcast from the learner. |
update_weight_cuda_ipc(name, dtype, shape, ipc_handles) |
Receive updated weights via CUDA IPC (same-node optimization). |
reset_prefix_cache() |
Clear the vLLM prefix cache after weight updates. |
get_kv_cache_info() |
Return the maximum number of concurrent sequences the KV-cache can support. |
get_model_dims() |
Return model dimensions (hidden size, num layers, vocab size). |
Usage Examples
from open_instruct.vllm_utils import create_vllm_engines
# Create 2 vLLM engines, each with 1 GPU
engines = create_vllm_engines(
num_engines=2,
tensor_parallel_size=1,
enforce_eager=False,
tokenizer_name_or_path="allenai/OLMo-2-7B",
pretrain="allenai/OLMo-2-7B",
revision=None,
seed=42,
enable_prefix_caching=True,
max_model_len=4096,
vllm_gpu_memory_utilization=0.9,
prompt_queue=prompt_queue,
results_queue=results_queue,
eval_results_queue=eval_results_queue,
actor_manager=actor_manager,
)
# Engines are now running and pulling from prompt_queue automatically
# Results appear in results_queue as GenerationResult objects
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment