Principle:Volcengine Verl Multimodal Rollout Generation
| Knowledge Sources | |
|---|---|
| Domains | Inference, Vision_Language_Models, Distributed_Systems |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of generating text completions from a vision-language model given both image and text inputs, using multimodal-capable inference engines.
Description
Multimodal Rollout Generation extends standard text-only rollout to handle vision-language inputs. The key differences from text-only rollout:
- Image preprocessing: Images are processed into pixel values and positional embeddings (e.g.,
image_grid_thwfor Qwen2.5-VL) - Multimodal data passing: The inference engine receives both text tokens and image tensors
- 3D position IDs: VLM models like Qwen2.5-VL use 3D RoPE position embeddings that account for image spatial dimensions
- Memory management: Images consume significant GPU memory, requiring careful
gpu_memory_utilizationtuning
The rollout request schema includes fields for multi_modal_data (for vLLM) and multi_modal_inputs (for training forward pass).
Usage
Use multimodal rollout generation when running RL training with vision-language models. Requires:
- A multimodal-capable inference engine (vLLM or SGLang with VLM support)
- Data prepared with image columns
- Sufficient GPU memory for image processing
Theoretical Basis
Multimodal rollout extends text rollout with vision encoding:
# Abstract multimodal rollout
for batch in data:
# Extract images and process into model format
images = batch["images"]
pixel_values, image_grid_thw = process_images(images)
# Build rollout request with multimodal data
request = RolloutRequest(
prompt_ids=batch["input_ids"],
multi_modal_data={"image": pixel_values},
multi_modal_inputs={
"pixel_values": pixel_values,
"image_grid_thw": image_grid_thw
}
)
# Generate responses
responses = rollout_engine.generate(request)