Implementation:Volcengine Verl Multimodal Rollout Request

Field	Value
Knowledge Sources	verl source code, rollout schemas module
Domains	Multimodal Rollout, VLM Inference, Request Management
Last Updated	2026-02-07

Overview

Description

AsyncRolloutRequest is a Pydantic BaseModel that encapsulates the complete state of a single rollout request during multi-turn generation. For VLM (Vision-Language Model) scenarios, it carries multimodal data alongside the standard text tokens.

The class manages three distinct multimodal data representations:

multi_modal_data (dict[str, Any]) -- Raw multimodal data organized by modality key (e.g., {"image": [PIL.Image, ...], "video": [...]}). This is the user-facing data format used when constructing messages and applying chat templates.

multi_modal_inputs (dict[str, torch.Tensor]) -- Processed tensor representations of multimodal data (e.g., pixel_values, image_grid_thw) produced by the processor's tokenization. These are the model-ready inputs.

multi_modal_keys (list[str]) -- The modality types present in this request. Defaults to ["image", "video"] if not specified.

The class also implements the _build_messages pattern (via RLHFDataset._build_messages) which replaces <image> and <video> placeholder strings in message content with structured content dicts like {"type": "image"} that the processor understands.

During multi-turn rollout, when tools return images, the add_tool_response_messages method updates both multi_modal_data and multi_modal_inputs incrementally, maintaining consistency across the growing conversation.

Usage

AsyncRolloutRequest instances are created by the rollout manager for each prompt in a batch. The model validator (initialize_request) handles tokenization, position ID computation, and multimodal input processing automatically at construction time.

Code Reference

Field	Value
Source Location	`verl/workers/rollout/schemas.py`, Lines 81-673
Class	`AsyncRolloutRequest(BaseModel)`
Import	`from verl.workers.rollout.schemas import AsyncRolloutRequest`
Related	`verl/utils/dataset/rl_dataset.py`, method `RLHFDataset._build_messages` (Lines 285-340)

I/O Contract

Inputs (Construction Fields)

Field	Type	Default	Description
`request_id`	`str`	required	Unique identifier for this rollout request.
`state`	`AsyncRolloutRequestStateEnum`	required	Current request state (PENDING, RUNNING, COMPLETED, FAILED, TOOL_CALLING, INTERACTING).
`messages`	`list[Message]`	required	Chat messages in OpenAI format.
`multi_modal_keys`	`Optional[list[str]]`	`["image", "video"]`	Modality types present in this request.
`multi_modal_data`	`Optional[dict[str, Any]]`	`{}`	Raw multimodal data by modality key.
`multi_modal_inputs`	`Optional[dict[str, torch.Tensor]]`	`{}`	Processed tensor inputs from the processor.
`tool_schemas`	`Optional[list[OpenAIFunctionToolSchema]]`	`None`	Tool schemas for tool-calling rollouts.
`max_prompt_len`	`int`	required	Maximum prompt length for truncation.
`max_response_len`	`int`	`8192`	Maximum response length.
`max_model_len`	`int`	`32768`	Maximum total sequence length.
`processing_class`	`PreTrainedTokenizer or ProcessorMixin`	required	Tokenizer or processor for chat template application (consumed during validation, not stored).

Outputs (Populated by Validator)

Field	Type	Description
`input_ids`	`torch.Tensor`	Tokenized input IDs with generation prompt appended.
`attention_mask`	`torch.Tensor`	Attention mask for the input sequence.
`position_ids`	`torch.Tensor`	Position IDs (special handling for Qwen2-VL with 3D rope positions).
`loss_mask`	`torch.Tensor`	Binary mask indicating which tokens contribute to the loss.
`multi_modal_inputs`	`dict[str, torch.Tensor]`	Processed multimodal tensors (e.g., `pixel_values`, `image_grid_thw`).
`prompt_ids`	`torch.Tensor`	Copy of input_ids for the prompt portion.
`generation_prompt_ids`	`torch.Tensor`	Token IDs for the generation prompt suffix.

Usage Examples

Creating a multimodal rollout request:

from verl.workers.rollout.schemas import AsyncRolloutRequest, AsyncRolloutRequestStateEnum
from PIL import Image

# Load an image for a VLM rollout
image = Image.open("geometry_diagram.png")

request = AsyncRolloutRequest(
    request_id="req_001",
    state=AsyncRolloutRequestStateEnum.PENDING,
    messages=[
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": "What is the area of this triangle?"},
        ]},
    ],
    multi_modal_data={"image": [image], "video": []},
    max_prompt_len=2048,
    max_response_len=4096,
    processing_class=processor,  # VLM processor (e.g., Qwen2VLProcessor)
    use_inference_chat_template=False,
    tokenization_sanity_check_mode="strict",
)

# After validation, multi_modal_inputs contains processed tensors
print(f"Input shape: {request.input_ids.shape}")
print(f"Multi-modal keys: {list(request.multi_modal_inputs.keys())}")
# e.g., ['pixel_values', 'image_grid_thw']

Adding tool responses with images during multi-turn rollout:

from verl.tools.schemas import ToolResponse

# Tool returns an image as part of its response
tool_response = ToolResponse(
    text="Here is the rendered diagram:",
    image=[rendered_image],  # PIL Image from tool execution
)

# This updates multi_modal_data, multi_modal_inputs, input_ids, etc.
request.add_tool_response_messages(
    processing_class=processor,
    contents=[tool_response],
)

RLHFDataset._build_messages replacing placeholders:

# In RLHFDataset._build_messages(example):
# Input message: {"role": "user", "content": "<image>\nDescribe this image."}
# After processing: {"role": "user", "content": [
#     {"type": "image"},
#     {"type": "text", "text": "\nDescribe this image."}
# ]}
# The <image> placeholder is replaced with a structured image content dict

Related Pages

Principle:Volcengine_Verl_Multimodal_Rollout_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment