Overview
Utility functions for evaluating vision-language models on STARE (Spatial Thinking And Reasoning Evaluation) benchmark, which tests spatial reasoning capabilities through multiple-choice questions with chain-of-thought or direct answer strategies.
Description
This module provides the evaluation infrastructure for STARE benchmark, supporting both LLM judge-based and rule-based answer extraction. It processes spatial reasoning questions with optional chain-of-thought prompting, extracts answers from LaTeX boxed notation, and validates predictions using sympy-based mathematical equivalence checking. The module integrates with the unified LLM judge server framework and supports multiple answer formats including letters, numbers, and LaTeX expressions.
Usage
Use this when evaluating multimodal models on spatial reasoning tasks with the STARE benchmark. Set USE_LMMS_JUDGE=true in metadata to enable GPT-4o-based answer validation, or use fast_extract_answer for rule-based extraction. Supports both CoT (chain-of-thought) and direct answering strategies.
Code Reference
Source Location
Signature
def stare_doc_to_text(doc: Dict) -> str
def stare_doc_to_visual(doc: Dict) -> List[Image.Image]
def stare_process_results(
doc: Dict,
results: List[str]
) -> Dict[str, Dict]
def stare_aggregate_results(results: List[Dict]) -> float
def fast_extract_answer(response: str) -> str
def is_equal(md_ans: str, gt_ans: str) -> bool
def build_query(sample: Dict) -> Dict
def extract_full_boxed_content(s: str) -> List[str]
Import
from lmms_eval.tasks.stare.utils import (
stare_doc_to_text,
stare_doc_to_visual,
stare_process_results,
stare_aggregate_results,
fast_extract_answer,
is_equal
)
I/O Contract
stare_process_results Inputs
| Parameter |
Type |
Description
|
| doc |
Dict |
Dataset sample with keys: question, answer, images, qid, category
|
| results |
List[str] |
Model prediction strings
|
stare_process_results Output
| Field |
Type |
Description
|
| stare_score |
Dict |
Contains id, query, gt_content, pred, category, is_correct, judge_response (if using LLM judge)
|
stare_aggregate_results Input
| Parameter |
Type |
Description
|
| results |
List[Dict] |
List of result dictionaries from stare_process_results
|
stare_aggregate_results Output
| Field |
Type |
Description
|
| accuracy |
float |
Overall accuracy score (0.0 to 1.0)
|
Usage Examples
# Build query with CoT strategy
doc = {
"question": "What is the spatial relationship between objects?\n<image>",
"answer": "A",
"images": [pil_image],
"qid": "stare_001",
"category": "spatial_relations"
}
query_dict = build_query(doc)
print(query_dict["query"])
# Output includes CoT instruction: "Please solve the problem step by step."
# Extract answer from LaTeX boxed notation
response = "The answer is \\boxed{B}. Let me explain..."
extracted = fast_extract_answer(response)
print(extracted) # "B"
# Check mathematical equivalence
is_equal("42", "42.0") # True
is_equal("2/3", "0.67") # True (within tolerance)
is_equal("\\frac{1}{2}", "0.5") # True (LaTeX to sympy conversion)
# Process results with LLM judge
results = ["The correct answer is B"]
output = stare_process_results(doc, results)
print(output["stare_score"]["is_correct"]) # True if judge confirms
# Aggregate results across dataset
all_results = [
{"category": "spatial_relations", "is_correct": True},
{"category": "spatial_relations", "is_correct": False},
{"category": "object_counting", "is_correct": True}
]
accuracy = stare_aggregate_results(all_results)
print(f"Overall Accuracy: {accuracy:.4f}")
# Also prints per-category breakdown
Related Pages