Principle:Haotian liu LLaVA Result Format Conversion

Overview

Process for converting model output files into benchmark-specific submission formats required by evaluation servers and local scoring tools.

Description

Different benchmarks require different submission formats, and LLaVA provides converter scripts that transform raw answer JSONL files into these required formats. The conversion pipeline bridges the gap between LLaVA's uniform answer JSONL output and the diverse format requirements of external evaluation platforms.

The primary converters include:

VQAv2 converter (convert_vqav2_for_submission.py) - Reads merged answer JSONL, applies VQA answer normalization via EvalAIAnswerProcessor, and produces a JSON array formatted for EvalAI submission. The processor normalizes answers by expanding contractions, converting number words to digits, removing articles, stripping punctuation, and collapsing whitespace.

MMBench converter (convert_mmbench_for_submission.py) - Joins model answer predictions with the original annotation TSV file and produces an Excel spreadsheet (.xlsx) for submission to the MMBench evaluation server at OpenCompass.

VizWiz converter (convert_vizwiz_for_submission.py) - Formats answers for VizWiz EvalAI submission.

GQA converter (convert_gqa_for_eval.py) - Prepares answers for GQA local evaluation.

SEED converter (convert_seed_for_submission.py) - Formats predictions for SEED-Bench leaderboard.

MM-Vet converter (convert_mmvet_for_eval.py) - Formats answers for MM-Vet evaluation notebook.

Usage

Use these converters after running batch VQA inference to prepare results for:

Online submission - EvalAI (VQAv2, VizWiz), OpenCompass (MMBench), SEED-Bench leaderboard
Local evaluation - GQA eval scripts, MM-Vet Jupyter notebooks

The converters are typically invoked automatically at the end of evaluation shell scripts (e.g., vqav2.sh calls convert_vqav2_for_submission.py after merging chunk files).

Theoretical Basis

VQA Answer Normalization

The VQA answer normalization follows the standard VQA evaluation protocol implemented in the EvalAIAnswerProcessor class (from m4c_evaluator.py). The normalization pipeline applies these transformations in order:

Word tokenization - Lowercase, remove commas and question marks, separate possessives
Whitespace normalization - Replace newlines and tabs with spaces
Punctuation processing - Remove or replace punctuation characters (; / [ ] " { } ( ) = + \ _ - > < @ ` , ? !)
Digit/article processing - Convert number words to digits (e.g., "three" to "3"), remove articles ("a", "an", "the"), expand contractions
Final cleanup - Strip leading/trailing whitespace

This normalization ensures fair comparison across models by removing superficial formatting differences in answers.

Format Requirements

Benchmark	Submission Target	Required Format
VQAv2	EvalAI server	JSON array of `{"question_id", "answer"}`
MMBench	OpenCompass	Excel spreadsheet with prediction column
VizWiz	EvalAI server	JSON with normalized answers
GQA	Local eval.py	Reformatted answer file
SEED-Bench	Leaderboard	JSONL with predictions

Knowledge Sources

Repo - LLaVA - https://github.com/haotian-liu/LLaVA

Domains

Evaluation
Data_Processing

Related Pages

Implementation:Haotian_liu_LLaVA_Convert_Results_For_Submission

Metadata

Property	Value
last_updated	2026-02-13 14:00 GMT
page_type	Principle
workflow	Benchmark_Evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment