Implementation:Haotian liu LLaVA LLaVA Conversation Format
Overview
Interface specification for formatting custom training data in LLaVA's JSON conversation format.
Type
Pattern Doc
Description
This is a data authoring pattern, not a library API. Users must create a JSON file following the exact schema expected by LazySupervisedDataset. The dataset class loads the JSON file as a list of dicts and processes each sample lazily during training. No separate validation step exists; malformed data will cause runtime errors during tokenization.
Source
docs/Finetune_Custom_Data.md:L1-37
Interface Specification
The training data JSON file must be a list of sample objects. Each object conforms to the following schema:
[
{
"id": "unique_id",
"image": "image_filename.jpg",
"conversations": [
{"from": "human", "value": "<image>\nDescribe this image."},
{"from": "gpt", "value": "The image shows..."}
]
}
]
Schema Fields
| Field | Type | Required | Description |
|---|---|---|---|
| id | string | Yes | Unique identifier for the sample |
| image | string | Yes | Relative path to image file (relative to --image_folder) |
| conversations | list[dict] | Yes | List of conversation turn objects |
| conversations[].from | string | Yes | Speaker role: "human" or "gpt" |
| conversations[].value | string | Yes | Turn content text |
Inputs
- Custom images (JPEG, PNG, or other PIL-compatible formats)
- Question-answer pairs for each image
Outputs
- JSON file consumable by LazySupervisedDataset
- Image folder containing all referenced images
Usage Examples
Single-Turn Conversation
A minimal single-turn example with one question and one answer:
[
{
"id": "997bb945-628d-4724-b370-b84de974a19f",
"image": "part-000001/997bb945-628d-4724-b370-b84de974a19f.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
},
{
"from": "gpt",
"value": "a beautiful painting of chernobyl by nekro, pascal blanche, john harris, greg rutkowski, sin jong hun, moebius, simon stalenhag. in style of cg art. ray tracing. cel shading. hyper detailed. realistic. ue 5. maya. octane render."
}
]
}
]
Multi-Turn Conversation
A multi-turn example with follow-up questions. Note that the <image> token appears only in the first human turn:
[
{
"id": "multi-turn-001",
"image": "photos/street_scene.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nWhat objects do you see in this image?"
},
{
"from": "gpt",
"value": "I can see a busy street with several cars, pedestrians walking on the sidewalk, and a traffic light at the intersection."
},
{
"from": "human",
"value": "What is the weather like in this scene?"
},
{
"from": "gpt",
"value": "The scene appears to be taken on a sunny day with clear blue skies and no visible clouds."
}
]
}
]
Key Rules
- The
<image>token must be present in the first human turn's "value" field. - Conversations must alternate between "human" and "gpt" turns.
- The first turn must be from "human".
- Image paths are relative to the --image_folder argument passed to the training script.
Metadata
| Field | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| source_repo | Haotian_liu_LLaVA |
| commit | 799f5f207c89 |
| type | Implementation (Pattern Doc) |