Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Haotian liu LLaVA LLaVA Conversation Format

From Leeroopedia

Overview

Interface specification for formatting custom training data in LLaVA's JSON conversation format.

Type

Pattern Doc

Description

This is a data authoring pattern, not a library API. Users must create a JSON file following the exact schema expected by LazySupervisedDataset. The dataset class loads the JSON file as a list of dicts and processes each sample lazily during training. No separate validation step exists; malformed data will cause runtime errors during tokenization.

Source

docs/Finetune_Custom_Data.md:L1-37

Interface Specification

The training data JSON file must be a list of sample objects. Each object conforms to the following schema:

[
  {
    "id": "unique_id",
    "image": "image_filename.jpg",
    "conversations": [
      {"from": "human", "value": "<image>\nDescribe this image."},
      {"from": "gpt", "value": "The image shows..."}
    ]
  }
]

Schema Fields

Field Type Required Description
id string Yes Unique identifier for the sample
image string Yes Relative path to image file (relative to --image_folder)
conversations list[dict] Yes List of conversation turn objects
conversations[].from string Yes Speaker role: "human" or "gpt"
conversations[].value string Yes Turn content text

Inputs

  • Custom images (JPEG, PNG, or other PIL-compatible formats)
  • Question-answer pairs for each image

Outputs

  • JSON file consumable by LazySupervisedDataset
  • Image folder containing all referenced images

Usage Examples

Single-Turn Conversation

A minimal single-turn example with one question and one answer:

[
  {
    "id": "997bb945-628d-4724-b370-b84de974a19f",
    "image": "part-000001/997bb945-628d-4724-b370-b84de974a19f.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWrite a prompt for Stable Diffusion to generate this image."
      },
      {
        "from": "gpt",
        "value": "a beautiful painting of chernobyl by nekro, pascal blanche, john harris, greg rutkowski, sin jong hun, moebius, simon stalenhag. in style of cg art. ray tracing. cel shading. hyper detailed. realistic. ue 5. maya. octane render."
      }
    ]
  }
]

Multi-Turn Conversation

A multi-turn example with follow-up questions. Note that the <image> token appears only in the first human turn:

[
  {
    "id": "multi-turn-001",
    "image": "photos/street_scene.jpg",
    "conversations": [
      {
        "from": "human",
        "value": "<image>\nWhat objects do you see in this image?"
      },
      {
        "from": "gpt",
        "value": "I can see a busy street with several cars, pedestrians walking on the sidewalk, and a traffic light at the intersection."
      },
      {
        "from": "human",
        "value": "What is the weather like in this scene?"
      },
      {
        "from": "gpt",
        "value": "The scene appears to be taken on a sunny day with clear blue skies and no visible clouds."
      }
    ]
  }
]

Key Rules

  • The <image> token must be present in the first human turn's "value" field.
  • Conversations must alternate between "human" and "gpt" turns.
  • The first turn must be from "human".
  • Image paths are relative to the --image_folder argument passed to the training script.

Metadata

Field Value
last_updated 2026-02-13 14:00 GMT
source_repo Haotian_liu_LLaVA
commit 799f5f207c89
type Implementation (Pattern Doc)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment