Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VSIBench Utils

From Leeroopedia

Source File: `lmms_eval/tasks/vsibench/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VSIBench Utils module provides evaluation functions for the VSIBench (Visual Spatial Intelligence Benchmark), which tests models' ability to reason about spatial relationships, distances, and navigation in video environments. It supports multiple question types with different evaluation metrics: multiple-choice accuracy (MCA) and numeric accuracy (NA) with mean relative accuracy.

Key Functions

Document Processing

vsibench_doc_to_visual(doc)
Prepares video path for model input
  • Reads YAML configuration to determine cache directory
  • Constructs video path from dataset name and scene name
  • Validates video file exists
  • Raises FileExistsError if video not found
  • Returns list containing video file path
vsibench_doc_to_text(doc, lmms_eval_specific_kwargs=None)
Formats question based on question type
  • Determines question type (MCA or NA)
  • For NA questions:
    • Uses pre_prompt (default: "These are frames of a video.")
    • Appends question text
    • Uses na_post_prompt (default: "Please answer the question using a single word or phrase.")
  • For MCA questions:
    • Uses pre_prompt
    • Appends question text
    • Formats options as numbered list
    • Uses mca_post_prompt (default: "Answer with the option's letter from the given choices directly.")
  • Raises ValueError for unknown question types
  • Returns formatted prompt string
process_docs(dataset: datasets.Dataset) -> datasets.Dataset
Optional dataset shuffling
  • Checks environment variable LMMS_EVAL_SHUFFLE_DOCS
  • If set, shuffles dataset with seed 42
  • Logs shuffling action
  • Returns processed or original dataset

Answer Processing

fuzzy_matching(pred)
Extracts answer from prediction text
  • Takes first word from prediction
  • Strips trailing periods
  • Returns cleaned answer string
to_float(pred)
Converts prediction to float
  • Attempts float conversion
  • Returns None on failure
  • Used for numeric answer questions

Evaluation Metrics

exact_match(pred, target)
Checks exact string match (case-insensitive)
  • Returns 1.0 if match, 0.0 otherwise
  • Used for multiple-choice accuracy
abs_dist_norm(pred, target)
Computes normalized absolute distance
  • Calculates |pred - target| / target
  • Used as error measure for numeric predictions
mean_relative_accuracy(pred, target, start, end, interval)
Computes mean relative accuracy across confidence intervals
  • Generates confidence intervals from start to end with specified interval
  • Example: start=0.5, end=0.95, interval=0.05 creates [0.5, 0.55, ..., 0.95, 1.0]
  • For each confidence level c:
    • Checks if abs_dist_norm(pred, target) ≤ 1 - c
    • Records binary accuracy at that threshold
  • Returns mean accuracy across all thresholds
  • More lenient metric that rewards approximately correct answers

Results Processing

vsibench_process_results(doc, results)
Processes prediction and computes metrics
  • Stores prediction in document
  • Applies fuzzy_matching to extract answer
  • For MCA question types:
    • Evaluates using exact_match against ground truth
    • Stores accuracy in document
  • For NA question types:
    • Converts prediction and ground truth to float
    • Computes MRA:.5:.95:.05 metric
    • Uses worst-case value (0.0) on TypeError
  • Raises ValueError for unknown question types
  • Returns dictionary with "vsibench_score" key containing updated document

Aggregation

vsibench_aggregate_results(results)
Aggregates results by question type and computes final scores
  • Converts results to pandas DataFrame
  • Groups by question_type
  • For each question type:
    • For MCA types: computes mean accuracy
    • For NA types: computes mean MRA
  • Combines three object_rel_direction difficulty levels:
    • Averages easy, medium, hard accuracy into single object_rel_direction_accuracy
    • Removes individual difficulty scores
  • Computes overall score as mean of all category scores
  • Logs evaluation results
  • Returns dictionary with per-category and overall scores

Question Types

Multiple-Choice Accuracy (MCA) Questions

MCA_QUESTION_TYPES = [
    "object_rel_direction_easy",
    "object_rel_direction_medium",
    "object_rel_direction_hard",
    "object_rel_distance",
    "route_planning",
    "obj_appearance_order",
]

Numeric Answer (NA) Questions

NA_QUESTION_TYPES = [
    "object_abs_distance",
    "object_counting",
    "object_size_estimation",
    "room_size_estimation",
]

Metrics Configuration

MCA Metrics

METRICS_FOR_MCA = {
    "accuracy": "exact_match",
}

NA Metrics

METRICS_FOR_NA = {
    "MRA:.5:.95:.05": "partial(mean_relative_accuracy, start=.5, end=.95, interval=.05)",
}

MRA (Mean Relative Accuracy) evaluates predictions across confidence thresholds from 50% to 95% in 5% increments.

Worst-Case Values

WORST_CASE_FOR_METRICS = {
    "accuracy": 0.0,
    "MRA:.5:.95:.05": 0.0,
}

Used when prediction parsing fails.

Design Characteristics

  • Dual Evaluation Modes: Separate handling for categorical and numeric answers
  • Lenient Numeric Metric: MRA rewards approximately correct numeric predictions
  • Spatial Reasoning Focus: Tests understanding of 3D space from video
  • Difficulty Stratification: Multiple difficulty levels for directional reasoning
  • Robust Parsing: Fuzzy matching extracts answers from verbose responses
  • Environment Integration: Reads cache configuration from YAML files
  • Optional Shuffling: Supports controlled dataset randomization

Dependencies

  • os - File system operations
  • functools.partial - Partial function application for metrics
  • pathlib.Path - Path manipulation
  • datasets - HuggingFace datasets library
  • numpy - Array operations for MRA computation
  • pandas - DataFrame operations for result aggregation
  • yaml - Configuration file parsing
  • loguru.logger - Logging

Usage Context

This module supports the VSIBench benchmark, which evaluates vision-language models' spatial intelligence in video environments. It tests abilities like directional reasoning (left/right/forward), distance estimation, counting objects, and route planning. The dual metric system handles both categorical (MCA) and continuous (NA) spatial reasoning tasks, with MRA providing nuanced evaluation of numeric predictions.

Cache Configuration

Base cache directory determined from:

  • Environment variable HF_HOME
  • Default: ~/.cache/huggingface/
  • Task-specific path from YAML: vsibench.yaml

Video files organized as: {cache_dir}/{dataset}/{scene_name}.mp4

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment