Implementation:EvolvingLMMs Lab Lmms eval VSIBench Utils

Source File: `lmms_eval/tasks/vsibench/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VSIBench Utils module provides evaluation functions for the VSIBench (Visual Spatial Intelligence Benchmark), which tests models' ability to reason about spatial relationships, distances, and navigation in video environments. It supports multiple question types with different evaluation metrics: multiple-choice accuracy (MCA) and numeric accuracy (NA) with mean relative accuracy.

Key Functions

Document Processing

vsibench_doc_to_visual(doc)

Prepares video path for model input

Reads YAML configuration to determine cache directory
Constructs video path from dataset name and scene name
Validates video file exists
Raises FileExistsError if video not found
Returns list containing video file path

vsibench_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Formats question based on question type

Determines question type (MCA or NA)
For NA questions:
- Uses pre_prompt (default: "These are frames of a video.")
- Appends question text
- Uses na_post_prompt (default: "Please answer the question using a single word or phrase.")
For MCA questions:
- Uses pre_prompt
- Appends question text
- Formats options as numbered list
- Uses mca_post_prompt (default: "Answer with the option's letter from the given choices directly.")
Raises ValueError for unknown question types
Returns formatted prompt string

process_docs(dataset: datasets.Dataset) -> datasets.Dataset

Optional dataset shuffling

Checks environment variable LMMS_EVAL_SHUFFLE_DOCS
If set, shuffles dataset with seed 42
Logs shuffling action
Returns processed or original dataset

Answer Processing

fuzzy_matching(pred)

Extracts answer from prediction text

Takes first word from prediction
Strips trailing periods
Returns cleaned answer string

to_float(pred)

Converts prediction to float

Attempts float conversion
Returns None on failure
Used for numeric answer questions

Evaluation Metrics

exact_match(pred, target)

Checks exact string match (case-insensitive)

Returns 1.0 if match, 0.0 otherwise
Used for multiple-choice accuracy

abs_dist_norm(pred, target)

Computes normalized absolute distance

Calculates |pred - target| / target
Used as error measure for numeric predictions

mean_relative_accuracy(pred, target, start, end, interval)

Computes mean relative accuracy across confidence intervals

Generates confidence intervals from start to end with specified interval
Example: start=0.5, end=0.95, interval=0.05 creates [0.5, 0.55, ..., 0.95, 1.0]
For each confidence level c:
- Checks if abs_dist_norm(pred, target) ≤ 1 - c
- Records binary accuracy at that threshold
Returns mean accuracy across all thresholds
More lenient metric that rewards approximately correct answers

Results Processing

vsibench_process_results(doc, results)

Processes prediction and computes metrics

Stores prediction in document
Applies fuzzy_matching to extract answer
For MCA question types:
- Evaluates using exact_match against ground truth
- Stores accuracy in document
For NA question types:
- Converts prediction and ground truth to float
- Computes MRA:.5:.95:.05 metric
- Uses worst-case value (0.0) on TypeError
Raises ValueError for unknown question types
Returns dictionary with "vsibench_score" key containing updated document

Aggregation

vsibench_aggregate_results(results)

Aggregates results by question type and computes final scores

Converts results to pandas DataFrame
Groups by question_type
For each question type:
- For MCA types: computes mean accuracy
- For NA types: computes mean MRA
Combines three object_rel_direction difficulty levels:
- Averages easy, medium, hard accuracy into single object_rel_direction_accuracy
- Removes individual difficulty scores
Computes overall score as mean of all category scores
Logs evaluation results
Returns dictionary with per-category and overall scores

Question Types

Multiple-Choice Accuracy (MCA) Questions

MCA_QUESTION_TYPES = [
    "object_rel_direction_easy",
    "object_rel_direction_medium",
    "object_rel_direction_hard",
    "object_rel_distance",
    "route_planning",
    "obj_appearance_order",
]

Numeric Answer (NA) Questions

NA_QUESTION_TYPES = [
    "object_abs_distance",
    "object_counting",
    "object_size_estimation",
    "room_size_estimation",
]

Metrics Configuration

MCA Metrics

METRICS_FOR_MCA = {
    "accuracy": "exact_match",
}

NA Metrics

METRICS_FOR_NA = {
    "MRA:.5:.95:.05": "partial(mean_relative_accuracy, start=.5, end=.95, interval=.05)",
}

MRA (Mean Relative Accuracy) evaluates predictions across confidence thresholds from 50% to 95% in 5% increments.

Worst-Case Values

WORST_CASE_FOR_METRICS = {
    "accuracy": 0.0,
    "MRA:.5:.95:.05": 0.0,
}

Used when prediction parsing fails.

Design Characteristics

Dual Evaluation Modes: Separate handling for categorical and numeric answers
Lenient Numeric Metric: MRA rewards approximately correct numeric predictions
Spatial Reasoning Focus: Tests understanding of 3D space from video
Difficulty Stratification: Multiple difficulty levels for directional reasoning
Robust Parsing: Fuzzy matching extracts answers from verbose responses
Environment Integration: Reads cache configuration from YAML files
Optional Shuffling: Supports controlled dataset randomization

Dependencies

os - File system operations
functools.partial - Partial function application for metrics
pathlib.Path - Path manipulation
datasets - HuggingFace datasets library
numpy - Array operations for MRA computation
pandas - DataFrame operations for result aggregation
yaml - Configuration file parsing
loguru.logger - Logging

Usage Context

This module supports the VSIBench benchmark, which evaluates vision-language models' spatial intelligence in video environments. It tests abilities like directional reasoning (left/right/forward), distance estimation, counting objects, and route planning. The dual metric system handles both categorical (MCA) and continuous (NA) spatial reasoning tasks, with MRA providing nuanced evaluation of numeric predictions.

Cache Configuration

Base cache directory determined from:

Environment variable HF_HOME
Default: ~/.cache/huggingface/
Task-specific path from YAML: vsibench.yaml

Video files organized as: {cache_dir}/{dataset}/{scene_name}.mp4

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment