Principle:Haotian liu LLaVA Evaluation Data Setup
Overview
Process for downloading and organizing evaluation benchmark datasets into a standardized directory structure required by LLaVA's evaluation pipeline.
Description
LLaVA evaluation requires 12+ benchmark datasets organized in a specific directory layout under ./playground/data/eval/. Each benchmark has its own subdirectory containing question JSONL files, annotation files, and image directories. The supported benchmarks include:
- VQAv2 - Visual Question Answering v2 (test2015 images + question JSONL)
- GQA - Graph Question Answering (images + evaluation scripts)
- TextVQA - Text-based Visual Question Answering (val JSON + train/val images)
- POPE - Polling-based Object Probing Evaluation (COCO annotation JSONs)
- MME - Multimodal Evaluation benchmark (images + eval_tool)
- MMBench - Multimodal Benchmark (TSV annotation + images)
- SEED-Bench - SEED Benchmark (images + video frames)
- LLaVA-Bench-in-the-Wild - Qualitative evaluation (questions JSONL + context JSONL + images)
- ScienceQA - Science Question Answering (images + pid_splits.json + problems.json)
- VizWiz - Visual Question Answering for the visually impaired (test.json + test images)
- Q-Bench - Quality Benchmark (question JSON + images)
- MM-Vet - Multimodal Veterinary evaluation (images + annotations)
The initial setup requires downloading a shared eval.zip from Google Drive, which contains custom annotations, evaluation scripts, and reference prediction files from LLaVA v1.5. This archive is extracted to ./playground/data/eval/ and provides the base directory structure. Individual benchmarks then require downloading their respective image datasets and placing them in the correct subdirectories.
Usage
This setup is required before running any benchmark evaluation. The datasets need only be downloaded once and can be reused across multiple evaluation runs with different model checkpoints. The standardized directory structure ensures that all evaluation shell scripts (under scripts/v1_5/eval/) can locate their required data files without modification.
Theoretical Basis
The standardized directory structure enables consistent benchmark evaluation scripts across all supported benchmarks. Question files use JSONL format (one JSON object per line) with the following standard fields:
| Field | Type | Description |
|---|---|---|
question_id |
int/str | Unique identifier for the question |
image |
str | Relative path to the image file |
text |
str | The question text (may include answer format instructions) |
category |
str | (Optional) Question category for per-category evaluation |
This uniform format allows a single inference engine (model_vqa_loader.py) to process multiple benchmarks with only path changes, while benchmark-specific post-processing handles format conversion and metric computation.
Knowledge Sources
- Doc - Evaluation - https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md
Domains
- Evaluation
- Data_Management
Related Pages
Metadata
| Property | Value |
|---|---|
| last_updated | 2026-02-13 14:00 GMT |
| page_type | Principle |
| workflow | Benchmark_Evaluation |