Implementation:EvolvingLMMs Lab Lmms eval Capability Utils
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Benchmark Evaluation, LLM Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
GPT-based evaluation utilities for the Capability benchmark that assesses multimodal models on video/image perception tasks.
Description
This module provides the evaluation infrastructure for the Capability benchmark, a comprehensive assessment framework for video and image understanding. It includes document processing functions to convert dataset instances into model inputs, result processing to format model outputs, and a sophisticated GPT-based evaluation system that scores model responses on precision, recall, and F1 score across 13 different perception tasks (object category, object number, object color, spatial relation, scene, camera angle/movement, OCR, style, character identification, events, and actions). The evaluator uses concurrent API calls to GPT models (OpenAI or Azure) to assess whether model-generated captions accurately describe visual content based on ground truth annotations.
Usage
Use this module when evaluating multimodal models on the Capability benchmark. The main workflow involves: (1) using doc_to_visual/doc_to_text to prepare inputs, (2) collecting model predictions, (3) using process_results to format individual predictions, (4) using aggregate functions to compute final metrics. The Evaluator class handles the GPT-based scoring with automatic retry logic, resume capability, and format validation.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/capability/utils.py
- Lines: 1-626
Signature
# Document conversion functions
def capability_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
"""Extract visual input (image or video path) from document."""
def capability_doc_to_text(doc, lmms_eval_specific_kwargs=None):
"""Extract prompt text based on data type (image/video)."""
# Result processing functions
def capability_process_results(doc, results):
"""Process single prediction result."""
def capability_aggregate_inference_result(results, args):
"""Save inference results to JSONL file."""
def capability_aggregate_results(results, args):
"""Main aggregation using GPT-based evaluation."""
def capability_aggregate_precision(results, args):
"""Aggregate and return precision metric."""
def capability_aggregate_recall(results, args):
"""Aggregate and return recall metric."""
def capability_aggregate_f1score(results, args):
"""Aggregate and return F1 score metric."""
# Evaluator class
class Evaluator:
def __init__(self, task, results, save_path, eval_model, headers,
num_process=0, max_allow_missing=5, max_retry_times=10,
auto_resume=True, strict_match=True):
"""Initialize evaluator with task-specific settings."""
def evaluate_scores(self):
"""Evaluate all samples using GPT API with retry logic."""
def calculate_metric(self, score_dict):
"""Calculate precision, recall, hit_rate, and F1 from scores."""
def call_gpt(self, system_prompt, user_prompt):
"""Call GPT API with prompts."""
Import
from lmms_eval.tasks.capability.utils import (
capability_doc_to_visual,
capability_doc_to_text,
capability_process_results,
capability_aggregate_precision,
capability_aggregate_recall,
capability_aggregate_f1score,
Evaluator
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document with file_id, file_path, data_type (image/video), annotation, task |
| results | list | Yes | List containing single prediction string from model |
| args | Namespace | Yes | Arguments with model name, log settings for file generation |
| lmms_eval_specific_kwargs | dict | No | Contains image_prompt/video_prompt templates |
Outputs
| Name | Type | Description |
|---|---|---|
| doc_to_visual return | list | List with single PIL Image or video file path |
| doc_to_text return | str | Formatted prompt text for the task |
| process_results return | dict | Dict with 4 metric keys mapping to response dict |
| aggregate functions return | float | Precision, recall, or F1 score (0-100 scale) |
| Evaluator.evaluate_scores return | dict | Dict mapping file_id to score (1, 0, -1, or list) |
| Evaluator.calculate_metric return | dict | Dict with precision, recall, hit_rate, f1_score |
Usage Examples
# Example 1: Document processing
doc = {
"file_id": "video_001",
"file_path": "hf://dataset/videos/sample.mp4",
"data_type": "video",
"annotation": "A person walking in the park",
"task": "action"
}
# Get visual input
visual = capability_doc_to_visual(doc) # Returns [file_path]
# Get text prompt
text = capability_doc_to_text(doc, {"video_prompt": "Describe this video:"})
# Returns: "Describe this video:"
# Example 2: Process model predictions
results = ["A person is walking through a park"]
processed = capability_process_results(doc, results)
# Returns dict with keys: capability_inference_result, capability_precision,
# capability_recall, capability_f1_score
# Example 3: Using Evaluator for GPT-based scoring
from lmms_eval.tasks.capability.utils import Evaluator
results_list = [
{
"file_id": "video_001",
"caption": "A person walking in the park",
"annotation": "A person walks in the park",
"task": "action"
}
]
evaluator = Evaluator(
task="action",
results=results_list,
save_path="./eval_results/action.jsonl",
eval_model="gpt-4",
headers={"Authorization": "Bearer YOUR_KEY"},
num_process=8, # Use 8 parallel workers
auto_resume=True # Resume from saved results
)
# Evaluate with GPT
score_dict = evaluator.evaluate_scores()
# Returns: {"video_001": 1} # 1=correct, 0=no answer, -1=incorrect
# Calculate final metrics
metrics = evaluator.calculate_metric(score_dict)
# Returns: {"precision": 100.0, "recall": 100.0, "hit_rate": 100.0, "f1_score": 100.0}