Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Haotian liu LLaVA Custom Dataset Formatting

From Leeroopedia

Overview

Data format specification for structuring custom visual question-answering data into LLaVA's expected conversation format.

Description

LLaVA requires training data in a specific JSON conversation format. Each sample is a dict with three required keys:

  • "id" -- A unique string identifier for the sample.
  • "image" -- The relative filename of the image (relative to the --image_folder argument).
  • "conversations" -- A list of turn dicts, each containing "from" and "value" keys.

Human turns use "from": "human" and must include the <image> token placeholder in the "value" field. GPT/assistant turns use "from": "gpt". This format is consumed directly by LazySupervisedDataset (defined in llava/train/train.py:L658), which loads the JSON file and lazily processes each sample during training.

The conversation list supports both single-turn and multi-turn formats. In multi-turn conversations, the <image> token should appear only in the first human turn. Each subsequent human-GPT pair adds an additional training target.

Usage

Use this pattern when preparing custom data for LoRA or full finetuning of LLaVA. All training data must conform to this schema. The JSON file path is passed via --data_path and the image directory via --image_folder in the training command.

Theoretical Basis

The conversation format maps directly to tokenization: human turns become input context (masked from loss computation), and GPT turns become training targets. The <image> placeholder is replaced during tokenization with IMAGE_TOKEN_INDEX (-200) tokens, which are later expanded to visual embeddings from the CLIP vision tower via the multimodal projector.

This masking strategy ensures the model only learns to predict assistant responses, not to reproduce user queries, which aligns with the standard instruction-tuning objective for language models.

Knowledge Sources

Domains

  • Data_Engineering
  • Fine_Tuning

Metadata

Field Value
last_updated 2026-02-13 14:00 GMT
source_repo Haotian_liu_LLaVA
commit 799f5f207c89
type Principle

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment