Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Multimodal Rollout Request

From Leeroopedia


Field Value
Knowledge Sources verl source code, rollout schemas module
Domains Multimodal Rollout, VLM Inference, Request Management
Last Updated 2026-02-07

Overview

Description

AsyncRolloutRequest is a Pydantic BaseModel that encapsulates the complete state of a single rollout request during multi-turn generation. For VLM (Vision-Language Model) scenarios, it carries multimodal data alongside the standard text tokens.

The class manages three distinct multimodal data representations:

  • multi_modal_data (dict[str, Any]) -- Raw multimodal data organized by modality key (e.g., {"image": [PIL.Image, ...], "video": [...]}). This is the user-facing data format used when constructing messages and applying chat templates.
  • multi_modal_inputs (dict[str, torch.Tensor]) -- Processed tensor representations of multimodal data (e.g., pixel_values, image_grid_thw) produced by the processor's tokenization. These are the model-ready inputs.
  • multi_modal_keys (list[str]) -- The modality types present in this request. Defaults to ["image", "video"] if not specified.

The class also implements the _build_messages pattern (via RLHFDataset._build_messages) which replaces <image> and <video> placeholder strings in message content with structured content dicts like {"type": "image"} that the processor understands.

During multi-turn rollout, when tools return images, the add_tool_response_messages method updates both multi_modal_data and multi_modal_inputs incrementally, maintaining consistency across the growing conversation.

Usage

AsyncRolloutRequest instances are created by the rollout manager for each prompt in a batch. The model validator (initialize_request) handles tokenization, position ID computation, and multimodal input processing automatically at construction time.

Code Reference

Field Value
Source Location verl/workers/rollout/schemas.py, Lines 81-673
Class AsyncRolloutRequest(BaseModel)
Import from verl.workers.rollout.schemas import AsyncRolloutRequest
Related verl/utils/dataset/rl_dataset.py, method RLHFDataset._build_messages (Lines 285-340)

I/O Contract

Inputs (Construction Fields)

Field Type Default Description
request_id str required Unique identifier for this rollout request.
state AsyncRolloutRequestStateEnum required Current request state (PENDING, RUNNING, COMPLETED, FAILED, TOOL_CALLING, INTERACTING).
messages list[Message] required Chat messages in OpenAI format.
multi_modal_keys Optional[list[str]] ["image", "video"] Modality types present in this request.
multi_modal_data Optional[dict[str, Any]] {} Raw multimodal data by modality key.
multi_modal_inputs Optional[dict[str, torch.Tensor]] {} Processed tensor inputs from the processor.
tool_schemas Optional[list[OpenAIFunctionToolSchema]] None Tool schemas for tool-calling rollouts.
max_prompt_len int required Maximum prompt length for truncation.
max_response_len int 8192 Maximum response length.
max_model_len int 32768 Maximum total sequence length.
processing_class PreTrainedTokenizer or ProcessorMixin required Tokenizer or processor for chat template application (consumed during validation, not stored).

Outputs (Populated by Validator)

Field Type Description
input_ids torch.Tensor Tokenized input IDs with generation prompt appended.
attention_mask torch.Tensor Attention mask for the input sequence.
position_ids torch.Tensor Position IDs (special handling for Qwen2-VL with 3D rope positions).
loss_mask torch.Tensor Binary mask indicating which tokens contribute to the loss.
multi_modal_inputs dict[str, torch.Tensor] Processed multimodal tensors (e.g., pixel_values, image_grid_thw).
prompt_ids torch.Tensor Copy of input_ids for the prompt portion.
generation_prompt_ids torch.Tensor Token IDs for the generation prompt suffix.

Usage Examples

Creating a multimodal rollout request:

from verl.workers.rollout.schemas import AsyncRolloutRequest, AsyncRolloutRequestStateEnum
from PIL import Image

# Load an image for a VLM rollout
image = Image.open("geometry_diagram.png")

request = AsyncRolloutRequest(
    request_id="req_001",
    state=AsyncRolloutRequestStateEnum.PENDING,
    messages=[
        {"role": "user", "content": [
            {"type": "image"},
            {"type": "text", "text": "What is the area of this triangle?"},
        ]},
    ],
    multi_modal_data={"image": [image], "video": []},
    max_prompt_len=2048,
    max_response_len=4096,
    processing_class=processor,  # VLM processor (e.g., Qwen2VLProcessor)
    use_inference_chat_template=False,
    tokenization_sanity_check_mode="strict",
)

# After validation, multi_modal_inputs contains processed tensors
print(f"Input shape: {request.input_ids.shape}")
print(f"Multi-modal keys: {list(request.multi_modal_inputs.keys())}")
# e.g., ['pixel_values', 'image_grid_thw']

Adding tool responses with images during multi-turn rollout:

from verl.tools.schemas import ToolResponse

# Tool returns an image as part of its response
tool_response = ToolResponse(
    text="Here is the rendered diagram:",
    image=[rendered_image],  # PIL Image from tool execution
)

# This updates multi_modal_data, multi_modal_inputs, input_ids, etc.
request.add_tool_response_messages(
    processing_class=processor,
    contents=[tool_response],
)

RLHFDataset._build_messages replacing placeholders:

# In RLHFDataset._build_messages(example):
# Input message: {"role": "user", "content": "<image>\nDescribe this image."}
# After processing: {"role": "user", "content": [
#     {"type": "image"},
#     {"type": "text", "text": "\nDescribe this image."}
# ]}
# The <image> placeholder is replaced with a structured image content dict

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment