Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples Multi Dataset VQA Preparation

From Leeroopedia


  1. Principle: Multi_Dataset_VQA_Preparation

Metadata

Field Value
Page Type Principle
Title Multi_Dataset_VQA_Preparation
Sources Paper: LLaVA (https://arxiv.org/abs/2304.08485), Paper: MIMIC-IT (https://arxiv.org/abs/2306.05425)
Domains Multimodal, Data_Processing, Computer_Vision
Repository Microsoft/DeepSpeedExamples
Application DeepSpeed-VisualChat
Status Active

Overview

A data preparation technique that unifies 13+ visual question answering datasets into a common format supporting multi-image, multi-round conversations.

Description

Training a multimodal chat model requires diverse visual data sources to develop broad capabilities across different visual understanding tasks. DeepSpeed-VisualChat supports 13 distinct dataset sources, each with its own format, annotation schema, and image organization. The data preparation pipeline unifies these disparate sources into a common format using a builder pattern with a shared base class (VQADataset).

Supported Dataset Sources

Dataset Name Class Description Multi-Image?
aokvqa AOKVQADataset Augmented OK-VQA with rationale-based visual QA No
coco_caption COCOCaptionDataset COCO image captioning No
llava LlavaDataset LLaVA instruction-following data No
llava_dial DialDataset LLaVA multi-turn dialogue data No
llava_otter_blend LlavaOtterBlendDataset Blended LLaVA and Otter data No
minigpt4 CcSbuAlignDataset CC-SBU alignment data (MiniGPT-4 style) No
ocr_vqa OCRVQADataset OCR-based visual question answering No
otter_mimicit_cgd OtterMimicitCgdDataset MIMIC-IT CGD split (change, goal, difference) Yes
otter_mimicit_sd OtterMimicitSdDataset MIMIC-IT scene description Yes
otter_mimicit_sn OtterMimicitSnDataset MIMIC-IT scene navigation Yes
otter_mimicit_tvc OtterMimicitTvcDataset MIMIC-IT TV caption Yes
otter_mimicit_vst OtterMimicitVstDataset MIMIC-IT visual storytelling Yes
sparkles_dialogue SparklesDialogueDataset Sparkles multi-image dialogue Yes

Unified Processing Pipeline

Each dataset goes through the same pipeline:

  • Image loading and preprocessing -- Images are loaded from disk, converted to RGB, and processed through the CLIP image processor (resizing, normalization, etc.)
  • Conversation formatting with DST -- Questions and answers are wrapped in the DeepSpeed Template (DST) format using special tokens
  • Multi-image grouping -- Multiple QA pairs can be randomly grouped into a single sample, with numbered image markers (### Image 1:, ### Image 2:, etc.)
  • Tokenization -- The formatted conversation is tokenized, producing input_ids, attention_mask, and labels tensors

Theoretical Basis

DeepSpeed Template (DST) Format

The DST format structures conversations with explicit role markers and special tokens:

[System Prompt]
You are a helpful language and vision assistant. You are able to understand
the visual content that the user provides, and assist the user with a
variety of tasks using natural language.

### Image 1:
<image>

### Question:
What is shown in this image?

### Answer:
The image shows a cat sitting on a windowsill.<endofchunk>

Key special tokens:

  • <image> -- Placeholder for visual features (replaced at runtime)
  • <answer> -- Marks the beginning of the model's response
  • <endofchunk> -- Marks the end of a response round
  • <im_patch>, <im_start>, <im_end> -- Image boundary markers

Multi-Image Grouping

For multi-image training, multiple QA pairs are randomly grouped into a single sample:

Sample with 3 images:
  ### Image 1: <image>  ### Question: Q1  ### Answer: A1<endofchunk>
  ### Image 2: <image>  ### Question: Q2  ### Answer: A2<endofchunk>
  ### Image 3: <image>  ### Question: Q3  ### Answer: A3<endofchunk>

The grouping is controlled by:

  • max_num_image_per_sample -- Maximum images per training sample (default 8)
  • dataset_concatenate_samples -- How many annotations to concatenate per data point

The random_grouping() function in DST performs the grouping using random partition sizes up to the maximum.

Label Masking Strategy

Labels are constructed so that the model only learns to predict answer tokens:

Tokens:   [system] [### Image 1:] [<image>] [### Question:] [Q tokens] [### Answer:] [A tokens] [<eos>]
Labels:   [-100]   [-100]         [-100]    [-100]          [-100]     [-100]        [A tokens] [<eos>]

The constant DEFAULT_LABEL_PADDING_NUM = -100 matches PyTorch's CrossEntropyLoss ignore index, ensuring that instruction and image tokens do not contribute to the training loss.

Dataset Sampling and Concatenation

The builder supports flexible dataset composition:

# Use all samples from llava, 512 samples from coco_caption
--dataset_names llava coco_caption
--dataset_samples all 512

# Concatenate 3 QA pairs per data point for llava
--dataset_concatenate_samples 3

When dataset_sample is not "all", a random subset is selected using np.random.choice without replacement. Multiple datasets are combined using PyTorch's ConcatDataset.

Key Considerations

  • Data path structure -- Each dataset class expects a specific directory structure under data_path with images and annotation JSON files.
  • Image count limits -- The system supports up to 8 images per sample (dictated by the image_mapping_dict in DST). The max_num_image_per_sample argument controls the actual limit used during training.
  • Tokenization truncation -- Individual QA pairs are truncated to 512 tokens during tokenization. The overall sequence length is controlled by max_seq_len (default 4096) during data collation.
  • Data collation -- The DataCollatorPadToMaxLen pads all samples in a batch to the maximum sequence length, handling the interleaved image and text tokens.
  • Debug support -- When data_debug_path is provided, the first 10 training samples (images and text) are saved for manual inspection.
  • Reproducibility -- The dataset is shuffled with a seeded RNG (np.random.RandomState(seed=args.seed)) and split into train/eval sets using a configurable ratio.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment