Implementation:EvolvingLMMs Lab Lmms eval STARE Spatial Reasoning Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Vision, Evaluation, Spatial_Reasoning
Last Updated	2026-02-14 00:00 GMT

Overview

Utility functions for evaluating vision-language models on STARE (Spatial Thinking And Reasoning Evaluation) benchmark, which tests spatial reasoning capabilities through multiple-choice questions with chain-of-thought or direct answer strategies.

Description

This module provides the evaluation infrastructure for STARE benchmark, supporting both LLM judge-based and rule-based answer extraction. It processes spatial reasoning questions with optional chain-of-thought prompting, extracts answers from LaTeX boxed notation, and validates predictions using sympy-based mathematical equivalence checking. The module integrates with the unified LLM judge server framework and supports multiple answer formats including letters, numbers, and LaTeX expressions.

Usage

Use this when evaluating multimodal models on spatial reasoning tasks with the STARE benchmark. Set USE_LMMS_JUDGE=true in metadata to enable GPT-4o-based answer validation, or use fast_extract_answer for rule-based extraction. Supports both CoT (chain-of-thought) and direct answering strategies.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/stare/utils.py

Signature

def stare_doc_to_text(doc: Dict) -> str
def stare_doc_to_visual(doc: Dict) -> List[Image.Image]

def stare_process_results(
    doc: Dict,
    results: List[str]
) -> Dict[str, Dict]

def stare_aggregate_results(results: List[Dict]) -> float

def fast_extract_answer(response: str) -> str
def is_equal(md_ans: str, gt_ans: str) -> bool

def build_query(sample: Dict) -> Dict
def extract_full_boxed_content(s: str) -> List[str]

Import

from lmms_eval.tasks.stare.utils import (
    stare_doc_to_text,
    stare_doc_to_visual,
    stare_process_results,
    stare_aggregate_results,
    fast_extract_answer,
    is_equal
)

I/O Contract

stare_process_results Inputs

Parameter	Type	Description
doc	Dict	Dataset sample with keys: question, answer, images, qid, category
results	List[str]	Model prediction strings

stare_process_results Output

Field	Type	Description
stare_score	Dict	Contains id, query, gt_content, pred, category, is_correct, judge_response (if using LLM judge)

stare_aggregate_results Input

Parameter	Type	Description
results	List[Dict]	List of result dictionaries from stare_process_results

stare_aggregate_results Output

Field	Type	Description
accuracy	float	Overall accuracy score (0.0 to 1.0)

Usage Examples

# Build query with CoT strategy
doc = {
    "question": "What is the spatial relationship between objects?\n<image>",
    "answer": "A",
    "images": [pil_image],
    "qid": "stare_001",
    "category": "spatial_relations"
}

query_dict = build_query(doc)
print(query_dict["query"])
# Output includes CoT instruction: "Please solve the problem step by step."

# Extract answer from LaTeX boxed notation
response = "The answer is \\boxed{B}. Let me explain..."
extracted = fast_extract_answer(response)
print(extracted)  # "B"

# Check mathematical equivalence
is_equal("42", "42.0")  # True
is_equal("2/3", "0.67")  # True (within tolerance)
is_equal("\\frac{1}{2}", "0.5")  # True (LaTeX to sympy conversion)

# Process results with LLM judge
results = ["The correct answer is B"]
output = stare_process_results(doc, results)
print(output["stare_score"]["is_correct"])  # True if judge confirms

# Aggregate results across dataset
all_results = [
    {"category": "spatial_relations", "is_correct": True},
    {"category": "spatial_relations", "is_correct": False},
    {"category": "object_counting", "is_correct": True}
]
accuracy = stare_aggregate_results(all_results)
print(f"Overall Accuracy: {accuracy:.4f}")
# Also prints per-category breakdown

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment