Workflow:Haotian liu LLaVA Benchmark Evaluation

Knowledge Sources	LLaVA LLaVA Evaluation Guide
Domains	LLMs, Evaluation, Multimodal, Benchmarking
Last Updated	2026-02-13 23:00 GMT

Overview

Systematic evaluation of LLaVA models across 12+ multimodal benchmarks using multi-GPU parallel inference, benchmark-specific formatting, and automated metric computation.

Description

This workflow covers the complete evaluation pipeline for assessing LLaVA model quality across a comprehensive suite of vision-language benchmarks. It supports short-answer VQA (VQAv2, TextVQA), hallucination detection (POPE), multi-choice reasoning (MMBench, ScienceQA, SEED-Bench), perception evaluation (MME), open-ended generation assessment (LLaVA-Bench, MM-Vet), and visual quality understanding (Q-Bench).

The evaluation approach uses greedy decoding (temperature=0) for reproducibility, multi-GPU chunk-based parallelism for throughput, and benchmark-specific answer formatting for compatibility with official evaluation servers and metrics.

Usage

Execute this workflow when you have a trained LLaVA model checkpoint and want to measure its performance on standard multimodal benchmarks, compare it against baselines, or prepare submissions for public leaderboards.

Execution Steps

Step 1: Download Evaluation Data

Download the official evaluation data package (eval.zip) containing custom annotations, question files, and reference predictions. Extract it to the playground data directory. For each specific benchmark, download the corresponding image datasets and place them in the expected directory structure.

Key considerations:

The eval.zip package provides the standardized question formats and directory structure
Each benchmark has its own image source (COCO, custom datasets, etc.)
Some benchmarks require additional tools (e.g., GQA evaluation scripts)

Step 2: Run Multi-GPU Inference

Execute the benchmark-specific evaluation script which splits the question file into chunks across available GPUs, runs parallel inference using the batch VQA loader, and concatenates the results. Each GPU processes its chunk independently using the same model checkpoint.

What happens:

Questions are split into N equal chunks (one per GPU)
Each GPU loads the model and processes its chunk via model_vqa_loader.py
The DataLoader-based inference pipeline applies conversation templates and image preprocessing
Greedy decoding (temperature=0) ensures deterministic outputs
Per-chunk answer files are concatenated into a single merged result file

Step 3: Format Results for Submission

Convert the raw model outputs into the format required by each benchmark's evaluation server or metric script. Different benchmarks have different submission requirements: VQAv2 needs a specific JSON format, MMBench requires TSV with option letters, and SEED-Bench has its own submission structure.

Key considerations:

VQAv2: Use convert_vqav2_for_submission.py to generate the upload format
GQA: Use convert_gqa_for_eval.py to reformat predictions
MMBench: Use convert_mmbench_for_submission.py for leaderboard submission
Some benchmarks (TextVQA, POPE, ScienceQA) compute metrics locally

Step 4: Compute Metrics

Run the benchmark-specific evaluation scripts to compute accuracy, F1 scores, or other metrics. Some benchmarks evaluate locally (TextVQA, POPE, ScienceQA, GQA), while others require submission to external evaluation servers (VQAv2, MMBench, VizWiz).

What happens:

eval_textvqa.py: Computes TextVQA accuracy using the M4C answer normalizer
eval_pope.py: Computes POPE metrics (accuracy, precision, recall, F1, yes-ratio)
eval_science_qa.py: Computes ScienceQA accuracy with per-category breakdown
GPT-4 evaluation scripts score open-ended benchmarks (LLaVA-Bench, MM-Vet)
External servers return scores for VQAv2, MMBench, and VizWiz

Step 5: Aggregate and Report Results

Collect metrics from all benchmarks into a unified results summary. For GPT-4-judged benchmarks, use the score summarization tool to aggregate per-category scores. Compare results against published baselines and reference predictions included in the evaluation data package.

Key considerations:

summarize_gpt_review.py aggregates GPT-4 evaluation scores by category
Reference predictions from LLaVA v1.5 are included for comparison
Results can be organized in a standardized table format across all benchmarks

Execution Diagram

GitHub URL

Workflow Repository