Principle:Microsoft DeepSpeedExamples Multi Dataset VQA Preparation

Principle: Multi_Dataset_VQA_Preparation

Metadata

Field	Value
Page Type	Principle
Title	Multi_Dataset_VQA_Preparation
Sources	Paper: LLaVA (https://arxiv.org/abs/2304.08485), Paper: MIMIC-IT (https://arxiv.org/abs/2306.05425)
Domains	Multimodal, Data_Processing, Computer_Vision
Repository	Microsoft/DeepSpeedExamples
Application	DeepSpeed-VisualChat
Status	Active

Overview

A data preparation technique that unifies 13+ visual question answering datasets into a common format supporting multi-image, multi-round conversations.

Description

Training a multimodal chat model requires diverse visual data sources to develop broad capabilities across different visual understanding tasks. DeepSpeed-VisualChat supports 13 distinct dataset sources, each with its own format, annotation schema, and image organization. The data preparation pipeline unifies these disparate sources into a common format using a builder pattern with a shared base class (VQADataset).

Supported Dataset Sources

Dataset Name	Class	Description	Multi-Image?
`aokvqa`	AOKVQADataset	Augmented OK-VQA with rationale-based visual QA	No
`coco_caption`	COCOCaptionDataset	COCO image captioning	No
`llava`	LlavaDataset	LLaVA instruction-following data	No
`llava_dial`	DialDataset	LLaVA multi-turn dialogue data	No
`llava_otter_blend`	LlavaOtterBlendDataset	Blended LLaVA and Otter data	No
`minigpt4`	CcSbuAlignDataset	CC-SBU alignment data (MiniGPT-4 style)	No
`ocr_vqa`	OCRVQADataset	OCR-based visual question answering	No
`otter_mimicit_cgd`	OtterMimicitCgdDataset	MIMIC-IT CGD split (change, goal, difference)	Yes
`otter_mimicit_sd`	OtterMimicitSdDataset	MIMIC-IT scene description	Yes
`otter_mimicit_sn`	OtterMimicitSnDataset	MIMIC-IT scene navigation	Yes
`otter_mimicit_tvc`	OtterMimicitTvcDataset	MIMIC-IT TV caption	Yes
`otter_mimicit_vst`	OtterMimicitVstDataset	MIMIC-IT visual storytelling	Yes
`sparkles_dialogue`	SparklesDialogueDataset	Sparkles multi-image dialogue	Yes

Unified Processing Pipeline

Each dataset goes through the same pipeline:

Image loading and preprocessing -- Images are loaded from disk, converted to RGB, and processed through the CLIP image processor (resizing, normalization, etc.)
Conversation formatting with DST -- Questions and answers are wrapped in the DeepSpeed Template (DST) format using special tokens
Multi-image grouping -- Multiple QA pairs can be randomly grouped into a single sample, with numbered image markers (### Image 1:, ### Image 2:, etc.)
Tokenization -- The formatted conversation is tokenized, producing input_ids, attention_mask, and labels tensors

Theoretical Basis

DeepSpeed Template (DST) Format

The DST format structures conversations with explicit role markers and special tokens:

[System Prompt]
You are a helpful language and vision assistant. You are able to understand
the visual content that the user provides, and assist the user with a
variety of tasks using natural language.

### Image 1:
<image>

### Question:
What is shown in this image?

### Answer:
The image shows a cat sitting on a windowsill.<endofchunk>

Key special tokens:

<image> -- Placeholder for visual features (replaced at runtime)
<answer> -- Marks the beginning of the model's response
<endofchunk> -- Marks the end of a response round
<im_patch>, <im_start>, <im_end> -- Image boundary markers

Multi-Image Grouping

For multi-image training, multiple QA pairs are randomly grouped into a single sample:

Sample with 3 images:
  ### Image 1: <image>  ### Question: Q1  ### Answer: A1<endofchunk>
  ### Image 2: <image>  ### Question: Q2  ### Answer: A2<endofchunk>
  ### Image 3: <image>  ### Question: Q3  ### Answer: A3<endofchunk>

The grouping is controlled by:

max_num_image_per_sample -- Maximum images per training sample (default 8)
dataset_concatenate_samples -- How many annotations to concatenate per data point

The random_grouping() function in DST performs the grouping using random partition sizes up to the maximum.

Label Masking Strategy

Labels are constructed so that the model only learns to predict answer tokens:

Tokens:   [system] [### Image 1:] [<image>] [### Question:] [Q tokens] [### Answer:] [A tokens] [<eos>]
Labels:   [-100]   [-100]         [-100]    [-100]          [-100]     [-100]        [A tokens] [<eos>]

The constant DEFAULT_LABEL_PADDING_NUM = -100 matches PyTorch's CrossEntropyLoss ignore index, ensuring that instruction and image tokens do not contribute to the training loss.

Dataset Sampling and Concatenation

The builder supports flexible dataset composition:

# Use all samples from llava, 512 samples from coco_caption
--dataset_names llava coco_caption
--dataset_samples all 512

# Concatenate 3 QA pairs per data point for llava
--dataset_concatenate_samples 3

When dataset_sample is not "all", a random subset is selected using np.random.choice without replacement. Multiple datasets are combined using PyTorch's ConcatDataset.

Key Considerations

Data path structure -- Each dataset class expects a specific directory structure under data_path with images and annotation JSON files.
Image count limits -- The system supports up to 8 images per sample (dictated by the image_mapping_dict in DST). The max_num_image_per_sample argument controls the actual limit used during training.
Tokenization truncation -- Individual QA pairs are truncated to 512 tokens during tokenization. The overall sequence length is controlled by max_seq_len (default 4096) during data collation.
Data collation -- The DataCollatorPadToMaxLen pads all samples in a batch to the maximum sequence length, handling the interleaved image and text tokens.
Debug support -- When data_debug_path is provided, the first 10 training samples (images and text) are saved for manual inspection.
Reproducibility -- The dataset is shuffled with a seeded RNG (np.random.RandomState(seed=args.seed)) and split into train/eval sets using a configurable ratio.

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Build_Dataset -- The concrete dataset builder implementation
Principle:Microsoft_DeepSpeedExamples_Multimodal_Model_Composition -- The model that consumes the prepared data
Principle:Microsoft_DeepSpeedExamples_Multimodal_Distributed_Training -- The training loop that uses the prepared datasets

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment