Implementation:Haotian liu LLaVA LLaVA Conversation Format

Overview

Interface specification for formatting custom training data in LLaVA's JSON conversation format.

Type

Pattern Doc

Description

This is a data authoring pattern, not a library API. Users must create a JSON file following the exact schema expected by LazySupervisedDataset. The dataset class loads the JSON file as a list of dicts and processes each sample lazily during training. No separate validation step exists; malformed data will cause runtime errors during tokenization.

Source

docs/Finetune_Custom_Data.md:L1-37

Interface Specification

The training data JSON file must be a list of sample objects. Each object conforms to the following schema:

[
  {
    "id": "unique_id",
    "image": "image_filename.jpg",
    "conversations": [
      {"from": "human", "value": "<image>\nDescribe this image."},
      {"from": "gpt", "value": "The image shows..."}
    ]
  }
]

Schema Fields

Field	Type	Required	Description
id	string	Yes	Unique identifier for the sample
image	string	Yes	Relative path to image file (relative to --image_folder)
conversations	list[dict]	Yes	List of conversation turn objects
conversations[].from	string	Yes	Speaker role: "human" or "gpt"
conversations[].value	string	Yes	Turn content text

Inputs

Custom images (JPEG, PNG, or other PIL-compatible formats)
Question-answer pairs for each image

Outputs

JSON file consumable by LazySupervisedDataset
Image folder containing all referenced images

Usage Examples

Single-Turn Conversation

A minimal single-turn example with one question and one answer:

[
  {
    "id": "997bb945-628d-4724-b370-b84de974a19f",
    "image": "part-000001/997bb945-628d-4724-b370-b84de974a19f.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
      },
      {
        "from": "gpt",
        "value": "a beautiful painting of chernobyl by nekro, pascal blanche, john harris, greg rutkowski, sin jong hun, moebius, simon stalenhag. in style of cg art. ray tracing. cel shading. hyper detailed. realistic. ue 5. maya. octane render."
      }
    ]
  }
]

Multi-Turn Conversation

A multi-turn example with follow-up questions. Note that the <image> token appears only in the first human turn:

[
  {
    "id": "multi-turn-001",
    "image": "photos/street_scene.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat objects do you see in this image?"
      },
      {
        "from": "gpt",
        "value": "I can see a busy street with several cars, pedestrians walking on the sidewalk, and a traffic light at the intersection."
      },
      {
        "from": "human",
        "value": "What is the weather like in this scene?"
      },
      {
        "from": "gpt",
        "value": "The scene appears to be taken on a sunny day with clear blue skies and no visible clouds."
      }
    ]
  }
]

Key Rules

The <image> token must be present in the first human turn's "value" field.
Conversations must alternate between "human" and "gpt" turns.
The first turn must be from "human".
Image paths are relative to the --image_folder argument passed to the training script.

Metadata

Field	Value
last_updated	2026-02-13 14:00 GMT
source_repo	Haotian_liu_LLaVA
commit	799f5f207c89
type	Implementation (Pattern Doc)

Related Pages

implements Principle:Haotian_liu_LLaVA_Custom_Dataset_Formatting

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment