Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Haotian liu LLaVA LazySupervisedDataset Init

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete tool for lazy-loading multimodal conversation datasets provided by the LLaVA training pipeline. LazySupervisedDataset is a PyTorch Dataset that loads JSON conversation data and processes each sample on-demand, handling image loading, CLIP preprocessing, conversation tokenization with image token injection, and label masking.

Description

LazySupervisedDataset implements a memory-efficient data loading strategy for multimodal training. On initialization, it loads the JSON metadata into memory (conversation structures and image paths), but defers all heavy processing -- image loading, CLIP preprocessing, tokenization, and label construction -- to the __getitem__ method, which is called per-sample during training.

Key behaviors:

  • Image loading -- Opens images from disk via PIL, converts to RGB, and applies CLIP preprocessing. Supports two aspect ratio modes:
    • square -- Direct CLIP preprocessing (resize to 336x336)
    • pad -- Pads to square using CLIP mean pixel color, then preprocesses
  • Conversation preprocessing -- Applies the appropriate conversation template (plain for pretraining, v1 for finetuning) and injects the <image> token at the start of the first user turn.
  • Tokenization -- Converts the formatted conversation to token IDs using tokenizer_image_token(), which handles the special IMAGE_TOKEN_INDEX = -200 placeholder.
  • Label masking -- Copies input_ids to labels and masks user turn tokens with IGNORE_INDEX = -100.
  • Modality-aware length reporting -- The modality_lengths property returns positive lengths for samples with images and negative lengths for text-only samples, enabling modality-grouped batching.

Usage

LazySupervisedDataset is instantiated internally by make_supervised_data_module() and is not typically imported directly:

from llava.train.train import LazySupervisedDataset

Code Reference

Source Location

Signature

class LazySupervisedDataset(Dataset):
    """Dataset for supervised fine-tuning."""

    def __init__(self, data_path: str,
                 tokenizer: transformers.PreTrainedTokenizer,
                 data_args: DataArguments):
        super(LazySupervisedDataset, self).__init__()
        list_data_dict = json.load(open(data_path, "r"))
        rank0_print("Formatting inputs...Skip in lazy mode")
        self.tokenizer = tokenizer
        self.list_data_dict = list_data_dict
        self.data_args = data_args

    def __len__(self):
        return len(self.list_data_dict)

    @property
    def lengths(self):
        ...  # Returns word-count-based lengths for each sample

    @property
    def modality_lengths(self):
        ...  # Returns signed lengths (positive=image, negative=text-only)

    def __getitem__(self, i) -> Dict[str, torch.Tensor]:
        ...  # On-demand processing: image load + tokenize + mask labels

Import

from llava.train.train import LazySupervisedDataset

Note: This class is internal to the training pipeline. It is instantiated by make_supervised_data_module() and passed to LLaVATrainer.

I/O Contract

Inputs

Input Contract
Name Type Required Description
data_path str Yes Path to the JSON file containing conversation data. Each entry has "conversations" and optionally "image".
tokenizer PreTrainedTokenizer Yes HuggingFace tokenizer for the base LLM (e.g., Vicuna tokenizer).
data_args DataArguments Yes Dataclass containing:
  • image_folder (str) -- Base directory for image files
  • image_aspect_ratio (str) -- "square" or "pad"
  • image_processor (CLIPImageProcessor) -- CLIP preprocessing pipeline
  • is_multimodal (bool) -- Whether to process images

Outputs

Output Contract (per __getitem__ call)
Name Type Description
input_ids torch.Tensor [seq_len] Tokenized conversation with IMAGE_TOKEN_INDEX = -200 placeholders for image positions.
labels torch.Tensor [seq_len] Copy of input_ids with user turn tokens masked to IGNORE_INDEX = -100.
image torch.Tensor [3, 336, 336] CLIP-preprocessed image tensor. For text-only samples when is_multimodal=True, a zero tensor of the same shape is returned.

Usage Examples

Example 1: Internal Usage via make_supervised_data_module()

From llava/train/train.py lines 776--785 -- how the dataset is instantiated during training.

def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
                                data_args) -> Dict:
    """Make dataset and collator for supervised fine-tuning."""
    train_dataset = LazySupervisedDataset(
        tokenizer=tokenizer,
        data_path=data_args.data_path,
        data_args=data_args
    )
    data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
    return dict(
        train_dataset=train_dataset,
        eval_dataset=None,
        data_collator=data_collator
    )

Example 2: Expected JSON Data Format

The JSON file at data_path should contain entries in this format:

[
    {
        "id": "000000033471",
        "image": "coco/train2017/000000033471.jpg",
        "conversations": [
            {"from": "human", "value": "<image>\nWhat are the colors of the bus in the image?"},
            {"from": "gpt", "value": "The bus in the image is white and red."}
        ]
    }
]

Example 3: Image Aspect Ratio Handling

When image_aspect_ratio="pad", the image is padded to a square before CLIP preprocessing:

# Inside __getitem__ (lines 702-716)
if self.data_args.image_aspect_ratio == 'pad':
    def expand2square(pil_img, background_color):
        width, height = pil_img.size
        if width == height:
            return pil_img
        elif width > height:
            result = Image.new(pil_img.mode, (width, width), background_color)
            result.paste(pil_img, (0, (width - height) // 2))
            return result
        else:
            result = Image.new(pil_img.mode, (height, height), background_color)
            result.paste(pil_img, ((height - width) // 2, 0))
            return result
    image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
    image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment