Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Volcengine Verl Multimodal Data Preparation

From Leeroopedia


Knowledge Sources
Domains Data_Engineering, Vision_Language_Models, Multimodal
Last Updated 2026-02-07 14:00 GMT

Overview

The process of preparing datasets with images and text for vision-language model RL training, including image loading, chat template formatting, and parquet serialization with image columns.

Description

Multimodal Data Preparation extends standard RL data preparation to handle datasets containing images alongside text. The key additions:

  • Image column: Parquet files include a column of PIL Image objects
  • Image placeholders: Chat messages use <image> placeholders that are replaced with actual image data during tokenization
  • VLM-specific formatting: Prompts must follow the VLM model's expected format (e.g., Qwen2.5-VL uses specific image tokens)

Example datasets include:

  • Geo3K: Geometry problems with diagram images
  • Pokemon: Image captioning with visual question answering

Usage

Use multimodal data preparation when training vision-language models on tasks that require both visual and textual understanding. The output parquet files must include an images column containing PIL Image objects.

Theoretical Basis

Multimodal data preparation adds image handling to the standard pipeline:

# Abstract multimodal data preparation
for row in dataset:
    image = row["image"]  # PIL.Image
    prompt = [
        {"role": "user", "content": "<image>\n" + row["question"]}
    ]
    output_row = {
        "data_source": "geo3k",
        "prompt": prompt,
        "images": [image],     # PIL Image objects
        "ability": "geometry",
        "reward_model": {"style": "rule", "ground_truth": row["answer"]},
    }

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment