Implementation:Haotian liu LLaVA LazySupervisedDataset Init
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete tool for lazy-loading multimodal conversation datasets provided by the LLaVA training pipeline. LazySupervisedDataset is a PyTorch Dataset that loads JSON conversation data and processes each sample on-demand, handling image loading, CLIP preprocessing, conversation tokenization with image token injection, and label masking.
Description
LazySupervisedDataset implements a memory-efficient data loading strategy for multimodal training. On initialization, it loads the JSON metadata into memory (conversation structures and image paths), but defers all heavy processing -- image loading, CLIP preprocessing, tokenization, and label construction -- to the __getitem__ method, which is called per-sample during training.
Key behaviors:
- Image loading -- Opens images from disk via PIL, converts to RGB, and applies CLIP preprocessing. Supports two aspect ratio modes:
square-- Direct CLIP preprocessing (resize to 336x336)pad-- Pads to square using CLIP mean pixel color, then preprocesses
- Conversation preprocessing -- Applies the appropriate conversation template (plain for pretraining, v1 for finetuning) and injects the
<image>token at the start of the first user turn.
- Tokenization -- Converts the formatted conversation to token IDs using
tokenizer_image_token(), which handles the specialIMAGE_TOKEN_INDEX = -200placeholder.
- Label masking -- Copies
input_idstolabelsand masks user turn tokens withIGNORE_INDEX = -100.
- Modality-aware length reporting -- The
modality_lengthsproperty returns positive lengths for samples with images and negative lengths for text-only samples, enabling modality-grouped batching.
Usage
LazySupervisedDataset is instantiated internally by make_supervised_data_module() and is not typically imported directly:
from llava.train.train import LazySupervisedDataset
Code Reference
Source Location
- Repository:
https://github.com/haotian-liu/LLaVA - File:
llava/train/train.py, lines 658--739 - Related:
make_supervised_data_module()at lines 776--785
Signature
class LazySupervisedDataset(Dataset):
"""Dataset for supervised fine-tuning."""
def __init__(self, data_path: str,
tokenizer: transformers.PreTrainedTokenizer,
data_args: DataArguments):
super(LazySupervisedDataset, self).__init__()
list_data_dict = json.load(open(data_path, "r"))
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.list_data_dict = list_data_dict
self.data_args = data_args
def __len__(self):
return len(self.list_data_dict)
@property
def lengths(self):
... # Returns word-count-based lengths for each sample
@property
def modality_lengths(self):
... # Returns signed lengths (positive=image, negative=text-only)
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
... # On-demand processing: image load + tokenize + mask labels
Import
from llava.train.train import LazySupervisedDataset
Note: This class is internal to the training pipeline. It is instantiated by make_supervised_data_module() and passed to LLaVATrainer.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
data_path |
str | Yes | Path to the JSON file containing conversation data. Each entry has "conversations" and optionally "image".
|
tokenizer |
PreTrainedTokenizer | Yes | HuggingFace tokenizer for the base LLM (e.g., Vicuna tokenizer). |
data_args |
DataArguments | Yes | Dataclass containing:
|
Outputs
| Name | Type | Description |
|---|---|---|
input_ids |
torch.Tensor [seq_len] | Tokenized conversation with IMAGE_TOKEN_INDEX = -200 placeholders for image positions.
|
labels |
torch.Tensor [seq_len] | Copy of input_ids with user turn tokens masked to IGNORE_INDEX = -100.
|
image |
torch.Tensor [3, 336, 336] | CLIP-preprocessed image tensor. For text-only samples when is_multimodal=True, a zero tensor of the same shape is returned.
|
Usage Examples
Example 1: Internal Usage via make_supervised_data_module()
From llava/train/train.py lines 776--785 -- how the dataset is instantiated during training.
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer,
data_args) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
train_dataset = LazySupervisedDataset(
tokenizer=tokenizer,
data_path=data_args.data_path,
data_args=data_args
)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
return dict(
train_dataset=train_dataset,
eval_dataset=None,
data_collator=data_collator
)
Example 2: Expected JSON Data Format
The JSON file at data_path should contain entries in this format:
[
{
"id": "000000033471",
"image": "coco/train2017/000000033471.jpg",
"conversations": [
{"from": "human", "value": "<image>\nWhat are the colors of the bus in the image?"},
{"from": "gpt", "value": "The bus in the image is white and red."}
]
}
]
Example 3: Image Aspect Ratio Handling
When image_aspect_ratio="pad", the image is padded to a square before CLIP preprocessing:
# Inside __getitem__ (lines 702-716)
if self.data_args.image_aspect_ratio == 'pad':
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors='pt')['pixel_values'][0]