Principle:Volcengine Verl Multimodal Data Preparation

Knowledge Sources	verl verl VLM Documentation
Domains	Data_Engineering, Vision_Language_Models, Multimodal
Last Updated	2026-02-07 14:00 GMT

Overview

The process of preparing datasets with images and text for vision-language model RL training, including image loading, chat template formatting, and parquet serialization with image columns.

Description

Multimodal Data Preparation extends standard RL data preparation to handle datasets containing images alongside text. The key additions:

Image column: Parquet files include a column of PIL Image objects
Image placeholders: Chat messages use <image> placeholders that are replaced with actual image data during tokenization
VLM-specific formatting: Prompts must follow the VLM model's expected format (e.g., Qwen2.5-VL uses specific image tokens)

Example datasets include:

Geo3K: Geometry problems with diagram images
Pokemon: Image captioning with visual question answering

Usage

Use multimodal data preparation when training vision-language models on tasks that require both visual and textual understanding. The output parquet files must include an images column containing PIL Image objects.

Theoretical Basis

Multimodal data preparation adds image handling to the standard pipeline:

# Abstract multimodal data preparation
for row in dataset:
    image = row["image"]  # PIL.Image
    prompt = [
        {"role": "user", "content": "<image>\n" + row["question"]}
    ]
    output_row = {
        "data_source": "geo3k",
        "prompt": prompt,
        "images": [image],     # PIL Image objects
        "ability": "geometry",
        "reward_model": {"style": "rule", "ground_truth": row["answer"]},
    }

Related Pages

Implemented By

Implementation:Volcengine_Verl_Geo3K_Data_Preprocessing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment