Implementation:Haotian liu LLaVA LazySupervisedDataset Init

**Metadata**
Knowledge Sources	Visual Instruction Tuning LLaVA
Domains	Data_Engineering Multimodal_Learning
Last Updated	2026-02-13 00:00 GMT

Overview

Concrete tool for lazy-loading multimodal conversation datasets provided by the LLaVA training pipeline. LazySupervisedDataset is a PyTorch Dataset that loads JSON conversation data and processes each sample on-demand, handling image loading, CLIP preprocessing, conversation tokenization with image token injection, and label masking.

Description

LazySupervisedDataset implements a memory-efficient data loading strategy for multimodal training. On initialization, it loads the JSON metadata into memory (conversation structures and image paths), but defers all heavy processing -- image loading, CLIP preprocessing, tokenization, and label construction -- to the __getitem__ method, which is called per-sample during training.

Key behaviors:

Image loading -- Opens images from disk via PIL, converts to RGB, and applies CLIP preprocessing. Supports two aspect ratio modes:
- square -- Direct CLIP preprocessing (resize to 336x336)
- pad -- Pads to square using CLIP mean pixel color, then preprocesses

Conversation preprocessing -- Applies the appropriate conversation template (plain for pretraining, v1 for finetuning) and injects the <image> token at the start of the first user turn.

Tokenization -- Converts the formatted conversation to token IDs using tokenizer_image_token(), which handles the special IMAGE_TOKEN_INDEX = -200 placeholder.

Label masking -- Copies input_ids to labels and masks user turn tokens with IGNORE_INDEX = -100.

Modality-aware length reporting -- The modality_lengths property returns positive lengths for samples with images and negative lengths for text-only samples, enabling modality-grouped batching.

Usage

LazySupervisedDataset is instantiated internally by make_supervised_data_module() and is not typically imported directly:

from llava.train.train import LazySupervisedDataset

Code Reference

Source Location

Repository: https://github.com/haotian-liu/LLaVA
File: llava/train/train.py, lines 658--739
Related: make_supervised_data_module() at lines 776--785

Signature

class LazySupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, data_path: str,
                 tokenizer: transformers.PreTrainedTokenizer,
                 data_args: DataArguments):
        super(LazySupervisedDataset, self).__init__()
        list_data_dict = json.load(open(data_path, "r"))
        rank0_print("Formatting inputs...Skip in lazy mode")
        self.tokenizer = tokenizer
        self.list_data_dict = list_data_dict
        self.data_args = data_args

    def __len__(self):
        return len(self.list_data_dict)

    @property
    def lengths(self):
        ...  # Returns word-count-based lengths for each sample

    @property
    def modality_lengths(self):
        ...  # Returns signed lengths (positive=image, negative=text-only)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...  # On-demand processing: image load + tokenize + mask labels

Import

from llava.train.train import LazySupervisedDataset

Note: This class is internal to the training pipeline. It is instantiated by make_supervised_data_module() and passed to LLaVATrainer.

I/O Contract

Inputs

**Input Contract**
Name	Type	Required	Description
`data_path`	str	Yes	Path to the JSON file containing conversation data. Each entry has `"conversations"` and optionally `"image"`.
`tokenizer`	PreTrainedTokenizer	Yes	HuggingFace tokenizer for the base LLM (e.g., Vicuna tokenizer).
`data_args`	DataArguments	Yes	Dataclass containing: `image_folder` (str) -- Base directory for image files `image_aspect_ratio` (str) -- `"square"` or `"pad"` `image_processor` (CLIPImageProcessor) -- CLIP preprocessing pipeline `is_multimodal` (bool) -- Whether to process images

Outputs

**Output Contract (per __getitem__ call)**
Name	Type	Description
`input_ids`	torch.Tensor [seq_len]	Tokenized conversation with `IMAGE_TOKEN_INDEX = -200` placeholders for image positions.
`labels`	torch.Tensor [seq_len]	Copy of `input_ids` with user turn tokens masked to `IGNORE_INDEX = -100`.
`image`	torch.Tensor [3, 336, 336]	CLIP-preprocessed image tensor. For text-only samples when `is_multimodal=True`, a zero tensor of the same shape is returned.

Usage Examples

Example 1: Internal Usage via make_supervised_data_module()

From llava/train/train.py lines 776--785 -- how the dataset is instantiated during training.

def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
                                data_args) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = LazySupervisedDataset(
        tokenizer=tokenizer,
        data_path=data_args.data_path,
        data_args=data_args
    )
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(
        train_dataset=train_dataset,
        eval_dataset=None,
        data_collator=data_collator
    )

Example 2: Expected JSON Data Format

The JSON file at data_path should contain entries in this format:

[
    {
        "id": "000000033471",
        "image": "coco/train2017/000000033471.jpg",
        "conversations": [
            {"from": "human", "value": "<image>\nWhat are the colors of the bus in the image?"},
            {"from": "gpt", "value": "The bus in the image is white and red."}
        ]
    }
]

Example 3: Image Aspect Ratio Handling

When image_aspect_ratio="pad", the image is padded to a square before CLIP preprocessing:

# Inside __getitem__ (lines 702-716)
if self.data_args.image_aspect_ratio == 'pad':
    def expand2square(pil_img, background_color):
        width, height = pil_img.size
        if width == height:
            return pil_img
        elif width > height:
            result = Image.new(pil_img.mode, (width, width), background_color)
            result.paste(pil_img, (0, (width - height) // 2))
            return result
        else:
            result = Image.new(pil_img.mode, (height, height), background_color)
            result.paste(pil_img, ((height - width) // 2, 0))
            return result
    image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
    image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment