Principle:Microsoft DeepSpeedExamples Multi Dataset VQA Preparation
- Principle: Multi_Dataset_VQA_Preparation
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | Multi_Dataset_VQA_Preparation |
| Sources | Paper: LLaVA (https://arxiv.org/abs/2304.08485), Paper: MIMIC-IT (https://arxiv.org/abs/2306.05425) |
| Domains | Multimodal, Data_Processing, Computer_Vision |
| Repository | Microsoft/DeepSpeedExamples |
| Application | DeepSpeed-VisualChat |
| Status | Active |
Overview
A data preparation technique that unifies 13+ visual question answering datasets into a common format supporting multi-image, multi-round conversations.
Description
Training a multimodal chat model requires diverse visual data sources to develop broad capabilities across different visual understanding tasks. DeepSpeed-VisualChat supports 13 distinct dataset sources, each with its own format, annotation schema, and image organization. The data preparation pipeline unifies these disparate sources into a common format using a builder pattern with a shared base class (VQADataset).
Supported Dataset Sources
| Dataset Name | Class | Description | Multi-Image? |
|---|---|---|---|
aokvqa |
AOKVQADataset | Augmented OK-VQA with rationale-based visual QA | No |
coco_caption |
COCOCaptionDataset | COCO image captioning | No |
llava |
LlavaDataset | LLaVA instruction-following data | No |
llava_dial |
DialDataset | LLaVA multi-turn dialogue data | No |
llava_otter_blend |
LlavaOtterBlendDataset | Blended LLaVA and Otter data | No |
minigpt4 |
CcSbuAlignDataset | CC-SBU alignment data (MiniGPT-4 style) | No |
ocr_vqa |
OCRVQADataset | OCR-based visual question answering | No |
otter_mimicit_cgd |
OtterMimicitCgdDataset | MIMIC-IT CGD split (change, goal, difference) | Yes |
otter_mimicit_sd |
OtterMimicitSdDataset | MIMIC-IT scene description | Yes |
otter_mimicit_sn |
OtterMimicitSnDataset | MIMIC-IT scene navigation | Yes |
otter_mimicit_tvc |
OtterMimicitTvcDataset | MIMIC-IT TV caption | Yes |
otter_mimicit_vst |
OtterMimicitVstDataset | MIMIC-IT visual storytelling | Yes |
sparkles_dialogue |
SparklesDialogueDataset | Sparkles multi-image dialogue | Yes |
Unified Processing Pipeline
Each dataset goes through the same pipeline:
- Image loading and preprocessing -- Images are loaded from disk, converted to RGB, and processed through the CLIP image processor (resizing, normalization, etc.)
- Conversation formatting with DST -- Questions and answers are wrapped in the DeepSpeed Template (DST) format using special tokens
- Multi-image grouping -- Multiple QA pairs can be randomly grouped into a single sample, with numbered image markers (
### Image 1:,### Image 2:, etc.) - Tokenization -- The formatted conversation is tokenized, producing
input_ids,attention_mask, andlabelstensors
Theoretical Basis
DeepSpeed Template (DST) Format
The DST format structures conversations with explicit role markers and special tokens:
[System Prompt]
You are a helpful language and vision assistant. You are able to understand
the visual content that the user provides, and assist the user with a
variety of tasks using natural language.
### Image 1:
<image>
### Question:
What is shown in this image?
### Answer:
The image shows a cat sitting on a windowsill.<endofchunk>
Key special tokens:
<image>-- Placeholder for visual features (replaced at runtime)<answer>-- Marks the beginning of the model's response<endofchunk>-- Marks the end of a response round<im_patch>,<im_start>,<im_end>-- Image boundary markers
Multi-Image Grouping
For multi-image training, multiple QA pairs are randomly grouped into a single sample:
Sample with 3 images:
### Image 1: <image> ### Question: Q1 ### Answer: A1<endofchunk>
### Image 2: <image> ### Question: Q2 ### Answer: A2<endofchunk>
### Image 3: <image> ### Question: Q3 ### Answer: A3<endofchunk>
The grouping is controlled by:
max_num_image_per_sample-- Maximum images per training sample (default 8)dataset_concatenate_samples-- How many annotations to concatenate per data point
The random_grouping() function in DST performs the grouping using random partition sizes up to the maximum.
Label Masking Strategy
Labels are constructed so that the model only learns to predict answer tokens:
Tokens: [system] [### Image 1:] [<image>] [### Question:] [Q tokens] [### Answer:] [A tokens] [<eos>]
Labels: [-100] [-100] [-100] [-100] [-100] [-100] [A tokens] [<eos>]
The constant DEFAULT_LABEL_PADDING_NUM = -100 matches PyTorch's CrossEntropyLoss ignore index, ensuring that instruction and image tokens do not contribute to the training loss.
Dataset Sampling and Concatenation
The builder supports flexible dataset composition:
# Use all samples from llava, 512 samples from coco_caption
--dataset_names llava coco_caption
--dataset_samples all 512
# Concatenate 3 QA pairs per data point for llava
--dataset_concatenate_samples 3
When dataset_sample is not "all", a random subset is selected using np.random.choice without replacement. Multiple datasets are combined using PyTorch's ConcatDataset.
Key Considerations
- Data path structure -- Each dataset class expects a specific directory structure under
data_pathwith images and annotation JSON files. - Image count limits -- The system supports up to 8 images per sample (dictated by the
image_mapping_dictin DST). Themax_num_image_per_sampleargument controls the actual limit used during training. - Tokenization truncation -- Individual QA pairs are truncated to 512 tokens during tokenization. The overall sequence length is controlled by
max_seq_len(default 4096) during data collation. - Data collation -- The
DataCollatorPadToMaxLenpads all samples in a batch to the maximum sequence length, handling the interleaved image and text tokens. - Debug support -- When
data_debug_pathis provided, the first 10 training samples (images and text) are saved for manual inspection. - Reproducibility -- The dataset is shuffled with a seeded RNG (
np.random.RandomState(seed=args.seed)) and split into train/eval sets using a configurable ratio.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Build_Dataset -- The concrete dataset builder implementation
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Model_Composition -- The model that consumes the prepared data
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Distributed_Training -- The training loop that uses the prepared datasets