Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval STARE Spatial Reasoning Utils

From Leeroopedia
Knowledge Sources
Domains Vision, Evaluation, Spatial_Reasoning
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for evaluating vision-language models on STARE (Spatial Thinking And Reasoning Evaluation) benchmark, which tests spatial reasoning capabilities through multiple-choice questions with chain-of-thought or direct answer strategies.

Description

This module provides the evaluation infrastructure for STARE benchmark, supporting both LLM judge-based and rule-based answer extraction. It processes spatial reasoning questions with optional chain-of-thought prompting, extracts answers from LaTeX boxed notation, and validates predictions using sympy-based mathematical equivalence checking. The module integrates with the unified LLM judge server framework and supports multiple answer formats including letters, numbers, and LaTeX expressions.

Usage

Use this when evaluating multimodal models on spatial reasoning tasks with the STARE benchmark. Set USE_LMMS_JUDGE=true in metadata to enable GPT-4o-based answer validation, or use fast_extract_answer for rule-based extraction. Supports both CoT (chain-of-thought) and direct answering strategies.

Code Reference

Source Location

Signature

def stare_doc_to_text(doc: Dict) -> str
def stare_doc_to_visual(doc: Dict) -> List[Image.Image]

def stare_process_results(
    doc: Dict,
    results: List[str]
) -> Dict[str, Dict]

def stare_aggregate_results(results: List[Dict]) -> float

def fast_extract_answer(response: str) -> str
def is_equal(md_ans: str, gt_ans: str) -> bool

def build_query(sample: Dict) -> Dict
def extract_full_boxed_content(s: str) -> List[str]

Import

from lmms_eval.tasks.stare.utils import (
    stare_doc_to_text,
    stare_doc_to_visual,
    stare_process_results,
    stare_aggregate_results,
    fast_extract_answer,
    is_equal
)

I/O Contract

stare_process_results Inputs

Parameter Type Description
doc Dict Dataset sample with keys: question, answer, images, qid, category
results List[str] Model prediction strings

stare_process_results Output

Field Type Description
stare_score Dict Contains id, query, gt_content, pred, category, is_correct, judge_response (if using LLM judge)

stare_aggregate_results Input

Parameter Type Description
results List[Dict] List of result dictionaries from stare_process_results

stare_aggregate_results Output

Field Type Description
accuracy float Overall accuracy score (0.0 to 1.0)

Usage Examples

# Build query with CoT strategy
doc = {
    "question": "What is the spatial relationship between objects?\n<image>",
    "answer": "A",
    "images": [pil_image],
    "qid": "stare_001",
    "category": "spatial_relations"
}

query_dict = build_query(doc)
print(query_dict["query"])
# Output includes CoT instruction: "Please solve the problem step by step."

# Extract answer from LaTeX boxed notation
response = "The answer is \\boxed{B}. Let me explain..."
extracted = fast_extract_answer(response)
print(extracted)  # "B"

# Check mathematical equivalence
is_equal("42", "42.0")  # True
is_equal("2/3", "0.67")  # True (within tolerance)
is_equal("\\frac{1}{2}", "0.5")  # True (LaTeX to sympy conversion)

# Process results with LLM judge
results = ["The correct answer is B"]
output = stare_process_results(doc, results)
print(output["stare_score"]["is_correct"])  # True if judge confirms

# Aggregate results across dataset
all_results = [
    {"category": "spatial_relations", "is_correct": True},
    {"category": "spatial_relations", "is_correct": False},
    {"category": "object_counting", "is_correct": True}
]
accuracy = stare_aggregate_results(all_results)
print(f"Overall Accuracy: {accuracy:.4f}")
# Also prints per-category breakdown

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment