Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Open compass VLMEvalKit Results Summarization

From Leeroopedia
Field Value
Source Repo
Domain Vision, Evaluation, Data_Processing

Overview

An aggregation pattern that collects evaluation scores across multiple model-benchmark pairs into a unified comparison table.

Description

After running evaluation on multiple model x dataset combinations, VLMEvalKit provides utilities to aggregate results into summary tables. The get_score() function reads per-benchmark result files (_acc.csv, _score.csv, _score.json) and extracts the relevant metric for each benchmark. The gen_table() function iterates over all model x dataset pairs, collects scores, and produces a formatted DataFrame for cross-model comparison. This enables systematic benchmarking of VLMs across dozens of benchmarks.

Usage

Use after completing all evaluations. Run python scripts/summarize.py --model model1 model2 --data dataset1 dataset2 or call get_score() / gen_table() programmatically.

Theoretical Basis

Result aggregation — collect per-benchmark metrics into a cross-model comparison matrix. Each benchmark has its own metric type (accuracy, score, F1, etc.) and the summarizer handles the format differences transparently. The output matrix has:

  • Rows: Models under evaluation
  • Columns: Benchmark datasets
  • Cells: The primary metric for each model-benchmark pair

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment