Implementation:EvolvingLMMs Lab Lmms eval VSIBench Utils
Source File: `lmms_eval/tasks/vsibench/utils.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VSIBench Utils module provides evaluation functions for the VSIBench (Visual Spatial Intelligence Benchmark), which tests models' ability to reason about spatial relationships, distances, and navigation in video environments. It supports multiple question types with different evaluation metrics: multiple-choice accuracy (MCA) and numeric accuracy (NA) with mean relative accuracy.
Key Functions
Document Processing
vsibench_doc_to_visual(doc)- Prepares video path for model input
- Reads YAML configuration to determine cache directory
- Constructs video path from dataset name and scene name
- Validates video file exists
- Raises FileExistsError if video not found
- Returns list containing video file path
vsibench_doc_to_text(doc, lmms_eval_specific_kwargs=None)- Formats question based on question type
- Determines question type (MCA or NA)
- For NA questions:
- Uses pre_prompt (default: "These are frames of a video.")
- Appends question text
- Uses na_post_prompt (default: "Please answer the question using a single word or phrase.")
- For MCA questions:
- Uses pre_prompt
- Appends question text
- Formats options as numbered list
- Uses mca_post_prompt (default: "Answer with the option's letter from the given choices directly.")
- Raises ValueError for unknown question types
- Returns formatted prompt string
process_docs(dataset: datasets.Dataset) -> datasets.Dataset- Optional dataset shuffling
- Checks environment variable LMMS_EVAL_SHUFFLE_DOCS
- If set, shuffles dataset with seed 42
- Logs shuffling action
- Returns processed or original dataset
Answer Processing
fuzzy_matching(pred)- Extracts answer from prediction text
- Takes first word from prediction
- Strips trailing periods
- Returns cleaned answer string
to_float(pred)- Converts prediction to float
- Attempts float conversion
- Returns None on failure
- Used for numeric answer questions
Evaluation Metrics
exact_match(pred, target)- Checks exact string match (case-insensitive)
- Returns 1.0 if match, 0.0 otherwise
- Used for multiple-choice accuracy
abs_dist_norm(pred, target)- Computes normalized absolute distance
- Calculates |pred - target| / target
- Used as error measure for numeric predictions
mean_relative_accuracy(pred, target, start, end, interval)- Computes mean relative accuracy across confidence intervals
- Generates confidence intervals from start to end with specified interval
- Example: start=0.5, end=0.95, interval=0.05 creates [0.5, 0.55, ..., 0.95, 1.0]
- For each confidence level c:
- Checks if abs_dist_norm(pred, target) ≤ 1 - c
- Records binary accuracy at that threshold
- Returns mean accuracy across all thresholds
- More lenient metric that rewards approximately correct answers
Results Processing
vsibench_process_results(doc, results)- Processes prediction and computes metrics
- Stores prediction in document
- Applies fuzzy_matching to extract answer
- For MCA question types:
- Evaluates using exact_match against ground truth
- Stores accuracy in document
- For NA question types:
- Converts prediction and ground truth to float
- Computes MRA:.5:.95:.05 metric
- Uses worst-case value (0.0) on TypeError
- Raises ValueError for unknown question types
- Returns dictionary with "vsibench_score" key containing updated document
Aggregation
vsibench_aggregate_results(results)- Aggregates results by question type and computes final scores
- Converts results to pandas DataFrame
- Groups by question_type
- For each question type:
- For MCA types: computes mean accuracy
- For NA types: computes mean MRA
- Combines three object_rel_direction difficulty levels:
- Averages easy, medium, hard accuracy into single object_rel_direction_accuracy
- Removes individual difficulty scores
- Computes overall score as mean of all category scores
- Logs evaluation results
- Returns dictionary with per-category and overall scores
Question Types
Multiple-Choice Accuracy (MCA) Questions
MCA_QUESTION_TYPES = [
"object_rel_direction_easy",
"object_rel_direction_medium",
"object_rel_direction_hard",
"object_rel_distance",
"route_planning",
"obj_appearance_order",
]
Numeric Answer (NA) Questions
NA_QUESTION_TYPES = [
"object_abs_distance",
"object_counting",
"object_size_estimation",
"room_size_estimation",
]
Metrics Configuration
MCA Metrics
METRICS_FOR_MCA = {
"accuracy": "exact_match",
}
NA Metrics
METRICS_FOR_NA = {
"MRA:.5:.95:.05": "partial(mean_relative_accuracy, start=.5, end=.95, interval=.05)",
}
MRA (Mean Relative Accuracy) evaluates predictions across confidence thresholds from 50% to 95% in 5% increments.
Worst-Case Values
WORST_CASE_FOR_METRICS = {
"accuracy": 0.0,
"MRA:.5:.95:.05": 0.0,
}
Used when prediction parsing fails.
Design Characteristics
- Dual Evaluation Modes: Separate handling for categorical and numeric answers
- Lenient Numeric Metric: MRA rewards approximately correct numeric predictions
- Spatial Reasoning Focus: Tests understanding of 3D space from video
- Difficulty Stratification: Multiple difficulty levels for directional reasoning
- Robust Parsing: Fuzzy matching extracts answers from verbose responses
- Environment Integration: Reads cache configuration from YAML files
- Optional Shuffling: Supports controlled dataset randomization
Dependencies
os- File system operationsfunctools.partial- Partial function application for metricspathlib.Path- Path manipulationdatasets- HuggingFace datasets librarynumpy- Array operations for MRA computationpandas- DataFrame operations for result aggregationyaml- Configuration file parsingloguru.logger- Logging
Usage Context
This module supports the VSIBench benchmark, which evaluates vision-language models' spatial intelligence in video environments. It tests abilities like directional reasoning (left/right/forward), distance estimation, counting objects, and route planning. The dual metric system handles both categorical (MCA) and continuous (NA) spatial reasoning tasks, with MRA providing nuanced evaluation of numeric predictions.
Cache Configuration
Base cache directory determined from:
- Environment variable
HF_HOME - Default:
~/.cache/huggingface/ - Task-specific path from YAML:
vsibench.yaml
Video files organized as: {cache_dir}/{dataset}/{scene_name}.mp4