Implementation:Volcengine Verl Multimodal Rollout Request
| Field | Value |
|---|---|
| Knowledge Sources | verl source code, rollout schemas module |
| Domains | Multimodal Rollout, VLM Inference, Request Management |
| Last Updated | 2026-02-07 |
Overview
Description
AsyncRolloutRequest is a Pydantic BaseModel that encapsulates the complete state of a single rollout request during multi-turn generation. For VLM (Vision-Language Model) scenarios, it carries multimodal data alongside the standard text tokens.
The class manages three distinct multimodal data representations:
- multi_modal_data (
dict[str, Any]) -- Raw multimodal data organized by modality key (e.g.,{"image": [PIL.Image, ...], "video": [...]}). This is the user-facing data format used when constructing messages and applying chat templates.
- multi_modal_inputs (
dict[str, torch.Tensor]) -- Processed tensor representations of multimodal data (e.g.,pixel_values,image_grid_thw) produced by the processor's tokenization. These are the model-ready inputs.
- multi_modal_keys (
list[str]) -- The modality types present in this request. Defaults to["image", "video"]if not specified.
The class also implements the _build_messages pattern (via RLHFDataset._build_messages) which replaces <image> and <video> placeholder strings in message content with structured content dicts like {"type": "image"} that the processor understands.
During multi-turn rollout, when tools return images, the add_tool_response_messages method updates both multi_modal_data and multi_modal_inputs incrementally, maintaining consistency across the growing conversation.
Usage
AsyncRolloutRequest instances are created by the rollout manager for each prompt in a batch. The model validator (initialize_request) handles tokenization, position ID computation, and multimodal input processing automatically at construction time.
Code Reference
| Field | Value |
|---|---|
| Source Location | verl/workers/rollout/schemas.py, Lines 81-673
|
| Class | AsyncRolloutRequest(BaseModel)
|
| Import | from verl.workers.rollout.schemas import AsyncRolloutRequest
|
| Related | verl/utils/dataset/rl_dataset.py, method RLHFDataset._build_messages (Lines 285-340)
|
I/O Contract
Inputs (Construction Fields)
| Field | Type | Default | Description |
|---|---|---|---|
request_id |
str |
required | Unique identifier for this rollout request. |
state |
AsyncRolloutRequestStateEnum |
required | Current request state (PENDING, RUNNING, COMPLETED, FAILED, TOOL_CALLING, INTERACTING). |
messages |
list[Message] |
required | Chat messages in OpenAI format. |
multi_modal_keys |
Optional[list[str]] |
["image", "video"] |
Modality types present in this request. |
multi_modal_data |
Optional[dict[str, Any]] |
{} |
Raw multimodal data by modality key. |
multi_modal_inputs |
Optional[dict[str, torch.Tensor]] |
{} |
Processed tensor inputs from the processor. |
tool_schemas |
Optional[list[OpenAIFunctionToolSchema]] |
None |
Tool schemas for tool-calling rollouts. |
max_prompt_len |
int |
required | Maximum prompt length for truncation. |
max_response_len |
int |
8192 |
Maximum response length. |
max_model_len |
int |
32768 |
Maximum total sequence length. |
processing_class |
PreTrainedTokenizer or ProcessorMixin |
required | Tokenizer or processor for chat template application (consumed during validation, not stored). |
Outputs (Populated by Validator)
| Field | Type | Description |
|---|---|---|
input_ids |
torch.Tensor |
Tokenized input IDs with generation prompt appended. |
attention_mask |
torch.Tensor |
Attention mask for the input sequence. |
position_ids |
torch.Tensor |
Position IDs (special handling for Qwen2-VL with 3D rope positions). |
loss_mask |
torch.Tensor |
Binary mask indicating which tokens contribute to the loss. |
multi_modal_inputs |
dict[str, torch.Tensor] |
Processed multimodal tensors (e.g., pixel_values, image_grid_thw).
|
prompt_ids |
torch.Tensor |
Copy of input_ids for the prompt portion. |
generation_prompt_ids |
torch.Tensor |
Token IDs for the generation prompt suffix. |
Usage Examples
Creating a multimodal rollout request:
from verl.workers.rollout.schemas import AsyncRolloutRequest, AsyncRolloutRequestStateEnum
from PIL import Image
# Load an image for a VLM rollout
image = Image.open("geometry_diagram.png")
request = AsyncRolloutRequest(
request_id="req_001",
state=AsyncRolloutRequestStateEnum.PENDING,
messages=[
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "What is the area of this triangle?"},
]},
],
multi_modal_data={"image": [image], "video": []},
max_prompt_len=2048,
max_response_len=4096,
processing_class=processor, # VLM processor (e.g., Qwen2VLProcessor)
use_inference_chat_template=False,
tokenization_sanity_check_mode="strict",
)
# After validation, multi_modal_inputs contains processed tensors
print(f"Input shape: {request.input_ids.shape}")
print(f"Multi-modal keys: {list(request.multi_modal_inputs.keys())}")
# e.g., ['pixel_values', 'image_grid_thw']
Adding tool responses with images during multi-turn rollout:
from verl.tools.schemas import ToolResponse
# Tool returns an image as part of its response
tool_response = ToolResponse(
text="Here is the rendered diagram:",
image=[rendered_image], # PIL Image from tool execution
)
# This updates multi_modal_data, multi_modal_inputs, input_ids, etc.
request.add_tool_response_messages(
processing_class=processor,
contents=[tool_response],
)
RLHFDataset._build_messages replacing placeholders:
# In RLHFDataset._build_messages(example):
# Input message: {"role": "user", "content": "<image>\nDescribe this image."}
# After processing: {"role": "user", "content": [
# {"type": "image"},
# {"type": "text", "text": "\nDescribe this image."}
# ]}
# The <image> placeholder is replaced with a structured image content dict