Principle:Volcengine Verl Multimodal Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Vision_Language_Models, Multimodal |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of preparing datasets with images and text for vision-language model RL training, including image loading, chat template formatting, and parquet serialization with image columns.
Description
Multimodal Data Preparation extends standard RL data preparation to handle datasets containing images alongside text. The key additions:
- Image column: Parquet files include a column of PIL Image objects
- Image placeholders: Chat messages use
<image>placeholders that are replaced with actual image data during tokenization - VLM-specific formatting: Prompts must follow the VLM model's expected format (e.g., Qwen2.5-VL uses specific image tokens)
Example datasets include:
- Geo3K: Geometry problems with diagram images
- Pokemon: Image captioning with visual question answering
Usage
Use multimodal data preparation when training vision-language models on tasks that require both visual and textual understanding. The output parquet files must include an images column containing PIL Image objects.
Theoretical Basis
Multimodal data preparation adds image handling to the standard pipeline:
# Abstract multimodal data preparation
for row in dataset:
image = row["image"] # PIL.Image
prompt = [
{"role": "user", "content": "<image>\n" + row["question"]}
]
output_row = {
"data_source": "geo3k",
"prompt": prompt,
"images": [image], # PIL Image objects
"ability": "geometry",
"reward_model": {"style": "rule", "ground_truth": row["answer"]},
}