Principle:Haotian liu LLaVA Evaluation Data Setup

Overview

Process for downloading and organizing evaluation benchmark datasets into a standardized directory structure required by LLaVA's evaluation pipeline.

Description

LLaVA evaluation requires 12+ benchmark datasets organized in a specific directory layout under ./playground/data/eval/. Each benchmark has its own subdirectory containing question JSONL files, annotation files, and image directories. The supported benchmarks include:

VQAv2 - Visual Question Answering v2 (test2015 images + question JSONL)
GQA - Graph Question Answering (images + evaluation scripts)
TextVQA - Text-based Visual Question Answering (val JSON + train/val images)
POPE - Polling-based Object Probing Evaluation (COCO annotation JSONs)
MME - Multimodal Evaluation benchmark (images + eval_tool)
MMBench - Multimodal Benchmark (TSV annotation + images)
SEED-Bench - SEED Benchmark (images + video frames)
LLaVA-Bench-in-the-Wild - Qualitative evaluation (questions JSONL + context JSONL + images)
ScienceQA - Science Question Answering (images + pid_splits.json + problems.json)
VizWiz - Visual Question Answering for the visually impaired (test.json + test images)
Q-Bench - Quality Benchmark (question JSON + images)
MM-Vet - Multimodal Veterinary evaluation (images + annotations)

The initial setup requires downloading a shared eval.zip from Google Drive, which contains custom annotations, evaluation scripts, and reference prediction files from LLaVA v1.5. This archive is extracted to ./playground/data/eval/ and provides the base directory structure. Individual benchmarks then require downloading their respective image datasets and placing them in the correct subdirectories.

Usage

This setup is required before running any benchmark evaluation. The datasets need only be downloaded once and can be reused across multiple evaluation runs with different model checkpoints. The standardized directory structure ensures that all evaluation shell scripts (under scripts/v1_5/eval/) can locate their required data files without modification.

Theoretical Basis

The standardized directory structure enables consistent benchmark evaluation scripts across all supported benchmarks. Question files use JSONL format (one JSON object per line) with the following standard fields:

Field	Type	Description
`question_id`	int/str	Unique identifier for the question
`image`	str	Relative path to the image file
`text`	str	The question text (may include answer format instructions)
`category`	str	(Optional) Question category for per-category evaluation

This uniform format allows a single inference engine (model_vqa_loader.py) to process multiple benchmarks with only path changes, while benchmark-specific post-processing handles format conversion and metric computation.

Knowledge Sources

Doc - Evaluation - https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md

Domains

Evaluation
Data_Management

Related Pages

Implementation:Haotian_liu_LLaVA_Evaluation_Data_Download

Metadata

Property	Value
last_updated	2026-02-13 14:00 GMT
page_type	Principle
workflow	Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment