Implementation:EvolvingLMMs Lab Lmms eval Capability Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Computer Vision, Benchmark Evaluation, LLM Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

GPT-based evaluation utilities for the Capability benchmark that assesses multimodal models on video/image perception tasks.

Description

This module provides the evaluation infrastructure for the Capability benchmark, a comprehensive assessment framework for video and image understanding. It includes document processing functions to convert dataset instances into model inputs, result processing to format model outputs, and a sophisticated GPT-based evaluation system that scores model responses on precision, recall, and F1 score across 13 different perception tasks (object category, object number, object color, spatial relation, scene, camera angle/movement, OCR, style, character identification, events, and actions). The evaluator uses concurrent API calls to GPT models (OpenAI or Azure) to assess whether model-generated captions accurately describe visual content based on ground truth annotations.

Usage

Use this module when evaluating multimodal models on the Capability benchmark. The main workflow involves: (1) using doc_to_visual/doc_to_text to prepare inputs, (2) collecting model predictions, (3) using process_results to format individual predictions, (4) using aggregate functions to compute final metrics. The Evaluator class handles the GPT-based scoring with automatic retry logic, resume capability, and format validation.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/capability/utils.py
Lines: 1-626

Signature

# Document conversion functions
def capability_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
    """Extract visual input (image or video path) from document."""

def capability_doc_to_text(doc, lmms_eval_specific_kwargs=None):
    """Extract prompt text based on data type (image/video)."""

# Result processing functions
def capability_process_results(doc, results):
    """Process single prediction result."""

def capability_aggregate_inference_result(results, args):
    """Save inference results to JSONL file."""

def capability_aggregate_results(results, args):
    """Main aggregation using GPT-based evaluation."""

def capability_aggregate_precision(results, args):
    """Aggregate and return precision metric."""

def capability_aggregate_recall(results, args):
    """Aggregate and return recall metric."""

def capability_aggregate_f1score(results, args):
    """Aggregate and return F1 score metric."""

# Evaluator class
class Evaluator:
    def __init__(self, task, results, save_path, eval_model, headers,
                 num_process=0, max_allow_missing=5, max_retry_times=10,
                 auto_resume=True, strict_match=True):
        """Initialize evaluator with task-specific settings."""

    def evaluate_scores(self):
        """Evaluate all samples using GPT API with retry logic."""

    def calculate_metric(self, score_dict):
        """Calculate precision, recall, hit_rate, and F1 from scores."""

    def call_gpt(self, system_prompt, user_prompt):
        """Call GPT API with prompts."""

Import

from lmms_eval.tasks.capability.utils import (
    capability_doc_to_visual,
    capability_doc_to_text,
    capability_process_results,
    capability_aggregate_precision,
    capability_aggregate_recall,
    capability_aggregate_f1score,
    Evaluator
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document with file_id, file_path, data_type (image/video), annotation, task
results	list	Yes	List containing single prediction string from model
args	Namespace	Yes	Arguments with model name, log settings for file generation
lmms_eval_specific_kwargs	dict	No	Contains image_prompt/video_prompt templates

Outputs

Name	Type	Description
doc_to_visual return	list	List with single PIL Image or video file path
doc_to_text return	str	Formatted prompt text for the task
process_results return	dict	Dict with 4 metric keys mapping to response dict
aggregate functions return	float	Precision, recall, or F1 score (0-100 scale)
Evaluator.evaluate_scores return	dict	Dict mapping file_id to score (1, 0, -1, or list)
Evaluator.calculate_metric return	dict	Dict with precision, recall, hit_rate, f1_score

Usage Examples

# Example 1: Document processing
doc = {
    "file_id": "video_001",
    "file_path": "hf://dataset/videos/sample.mp4",
    "data_type": "video",
    "annotation": "A person walking in the park",
    "task": "action"
}

# Get visual input
visual = capability_doc_to_visual(doc)  # Returns [file_path]

# Get text prompt
text = capability_doc_to_text(doc, {"video_prompt": "Describe this video:"})
# Returns: "Describe this video:"

# Example 2: Process model predictions
results = ["A person is walking through a park"]
processed = capability_process_results(doc, results)
# Returns dict with keys: capability_inference_result, capability_precision,
# capability_recall, capability_f1_score

# Example 3: Using Evaluator for GPT-based scoring
from lmms_eval.tasks.capability.utils import Evaluator

results_list = [
    {
        "file_id": "video_001",
        "caption": "A person walking in the park",
        "annotation": "A person walks in the park",
        "task": "action"
    }
]

evaluator = Evaluator(
    task="action",
    results=results_list,
    save_path="./eval_results/action.jsonl",
    eval_model="gpt-4",
    headers={"Authorization": "Bearer YOUR_KEY"},
    num_process=8,  # Use 8 parallel workers
    auto_resume=True  # Resume from saved results
)

# Evaluate with GPT
score_dict = evaluator.evaluate_scores()
# Returns: {"video_001": 1}  # 1=correct, 0=no answer, -1=incorrect

# Calculate final metrics
metrics = evaluator.calculate_metric(score_dict)
# Returns: {"precision": 100.0, "recall": 100.0, "hit_rate": 100.0, "f1_score": 100.0}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment