Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Haotian liu LLaVA Evaluation Data Setup

From Leeroopedia

Overview

Process for downloading and organizing evaluation benchmark datasets into a standardized directory structure required by LLaVA's evaluation pipeline.

Description

LLaVA evaluation requires 12+ benchmark datasets organized in a specific directory layout under ./playground/data/eval/. Each benchmark has its own subdirectory containing question JSONL files, annotation files, and image directories. The supported benchmarks include:

  • VQAv2 - Visual Question Answering v2 (test2015 images + question JSONL)
  • GQA - Graph Question Answering (images + evaluation scripts)
  • TextVQA - Text-based Visual Question Answering (val JSON + train/val images)
  • POPE - Polling-based Object Probing Evaluation (COCO annotation JSONs)
  • MME - Multimodal Evaluation benchmark (images + eval_tool)
  • MMBench - Multimodal Benchmark (TSV annotation + images)
  • SEED-Bench - SEED Benchmark (images + video frames)
  • LLaVA-Bench-in-the-Wild - Qualitative evaluation (questions JSONL + context JSONL + images)
  • ScienceQA - Science Question Answering (images + pid_splits.json + problems.json)
  • VizWiz - Visual Question Answering for the visually impaired (test.json + test images)
  • Q-Bench - Quality Benchmark (question JSON + images)
  • MM-Vet - Multimodal Veterinary evaluation (images + annotations)

The initial setup requires downloading a shared eval.zip from Google Drive, which contains custom annotations, evaluation scripts, and reference prediction files from LLaVA v1.5. This archive is extracted to ./playground/data/eval/ and provides the base directory structure. Individual benchmarks then require downloading their respective image datasets and placing them in the correct subdirectories.

Usage

This setup is required before running any benchmark evaluation. The datasets need only be downloaded once and can be reused across multiple evaluation runs with different model checkpoints. The standardized directory structure ensures that all evaluation shell scripts (under scripts/v1_5/eval/) can locate their required data files without modification.

Theoretical Basis

The standardized directory structure enables consistent benchmark evaluation scripts across all supported benchmarks. Question files use JSONL format (one JSON object per line) with the following standard fields:

Field Type Description
question_id int/str Unique identifier for the question
image str Relative path to the image file
text str The question text (may include answer format instructions)
category str (Optional) Question category for per-category evaluation

This uniform format allows a single inference engine (model_vqa_loader.py) to process multiple benchmarks with only path changes, while benchmark-specific post-processing handles format conversion and metric computation.

Knowledge Sources

Domains

  • Evaluation
  • Data_Management

Related Pages

Metadata

Property Value
last_updated 2026-02-13 14:00 GMT
page_type Principle
workflow Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment