Principle:Haotian liu LLaVA Multimodal Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Technique for constructing multimodal training datasets that pair images with conversational text for vision-language model training. This principle covers lazy data loading, image preprocessing, conversation tokenization with special image tokens, and label masking for autoregressive loss computation.
Description
Multimodal data preparation for LLaVA involves four core operations applied to each training sample:
- JSON conversation loading -- Training data is stored as a JSON list where each entry contains a
"conversations"field (list of user/assistant turns) and optionally an"image"field (relative path to the image file).
- On-demand image loading and preprocessing -- Images are loaded from disk using PIL and preprocessed via CLIP's image processor (
CLIPImageProcessor). The processor resizes images to 336x336 pixels and normalizes pixel values. Two aspect ratio strategies are supported:square-- Direct resize to square (default for pretraining)pad-- Pad to square using the CLIP mean pixel value, then resize (used in finetuning to preserve aspect ratio information)
- Tokenization with image token injection -- Conversation turns are tokenized using the LLM's tokenizer. The special token
<image>(defined asDEFAULT_IMAGE_TOKEN) is placed at the start of the first user message. During tokenization, this token is converted toIMAGE_TOKEN_INDEX = -200, which is later replaced with actual visual embeddings during the forward pass.
- Label masking -- Labels are constructed as a copy of input_ids, with all user turn tokens masked to
IGNORE_INDEX = -100. This ensures the autoregressive cross-entropy loss is computed only on assistant response tokens, teaching the model to generate appropriate responses rather than memorize user prompts.
The lazy loading strategy tokenizes and processes each sample on access (in __getitem__) rather than preloading the entire dataset into memory. This trades compute for memory efficiency -- critical when training on datasets with hundreds of thousands of high-resolution images.
Usage
Use this principle when preparing training data for any LLaVA training stage. The same dataset class (LazySupervisedDataset) handles both:
- Stage 1 pretraining -- Image-caption pairs from the 558K CC3M filtered subset. Uses
version=plainconversation format with simple image-caption structure. - Stage 2 finetuning -- Multi-turn visual instruction data from the 665K mixed dataset. Uses
version=v1conversation format with full multi-turn dialogue.
The DataCollatorForSupervisedDataset handles batching by padding sequences to the longest in the batch and stacking images into a batch tensor.
Theoretical Basis
Each training sample is structured as a conversation paired with an optional image. The data processing pipeline operates as follows:
INPUT: JSON entry = { "image": "path/to/img.jpg",
"conversations": [
{"from": "human", "value": "<image>\nDescribe this image."},
{"from": "gpt", "value": "The image shows..."}
] }
STEP 1: Image Processing
image = CLIP_Processor(PIL.open(image_path))
result: Tensor[3, 336, 336] (normalized float)
STEP 2: Conversation Tokenization
text = apply_conversation_template(conversations)
input_ids = tokenizer(text)
# <image> token -> IMAGE_TOKEN_INDEX (-200)
STEP 3: Label Masking
labels = copy(input_ids)
labels[user_turn_tokens] = IGNORE_INDEX (-100)
# Only assistant tokens contribute to loss
OUTPUT: { "input_ids": Tensor[seq_len],
"labels": Tensor[seq_len],
"image": Tensor[3, 336, 336] }
The label masking strategy is essential for instruction tuning: by setting IGNORE_INDEX = -100 on user tokens, PyTorch's CrossEntropyLoss (which uses ignore_index=-100 by default) automatically excludes these positions from the loss computation. This means the model learns to generate assistant responses conditioned on user inputs and images, without being penalized for user turn tokens it does not need to predict.
For multimodal samples, the modality_lengths property returns positive lengths for image-containing samples and negative lengths for text-only samples. This sign convention enables the LengthGroupedSampler to separate modalities into distinct batches, reducing padding waste.