Principle:Open compass VLMEvalKit Results Summarization
| Field | Value |
|---|---|
| Source | Repo |
| Domain | Vision, Evaluation, Data_Processing |
Overview
An aggregation pattern that collects evaluation scores across multiple model-benchmark pairs into a unified comparison table.
Description
After running evaluation on multiple model x dataset combinations, VLMEvalKit provides utilities to aggregate results into summary tables. The get_score() function reads per-benchmark result files (_acc.csv, _score.csv, _score.json) and extracts the relevant metric for each benchmark. The gen_table() function iterates over all model x dataset pairs, collects scores, and produces a formatted DataFrame for cross-model comparison. This enables systematic benchmarking of VLMs across dozens of benchmarks.
Usage
Use after completing all evaluations. Run python scripts/summarize.py --model model1 model2 --data dataset1 dataset2 or call get_score() / gen_table() programmatically.
Theoretical Basis
Result aggregation — collect per-benchmark metrics into a cross-model comparison matrix. Each benchmark has its own metric type (accuracy, score, F1, etc.) and the summarizer handles the format differences transparently. The output matrix has:
- Rows: Models under evaluation
- Columns: Benchmark datasets
- Cells: The primary metric for each model-benchmark pair