Overview
Utility functions for evaluating vision-language models on Uni-MMMU benchmark, which tests visual puzzle solving across jigsaw completion, maze navigation, sliding puzzles, and geometry problems.
Description
This module provides specialized evaluation functions for four distinct visual puzzle types in Uni-MMMU: (1) Jigsaw puzzles requiring patch selection based on seam continuity and semantics, (2) Maze solving with path finding from start to goal, (3) Sliding puzzles requiring tile movement sequences, and (4) Geometry problems with auxiliary line construction and step-by-step solutions. Each puzzle type has custom prompt templates, answer extraction logic, and evaluation metrics (exact match for jigsaw, frame accuracy for sequential puzzles).
Usage
Use this when evaluating multimodal models on visual puzzle and spatial reasoning tasks. Each puzzle type has dedicated doc_to_text, doc_to_visual, and process_results functions. Jigsaw puzzles expect JSON output with choice (0 or 1), while maze and sliding puzzles expect move sequences in JSON arrays. Geometry problems use natural language extraction with normalization.
Code Reference
Source Location
Signature
# Jigsaw puzzle functions
def jigsaw_doc_to_visual(doc: Dict) -> List[Image.Image]
def jigsaw_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def jigsaw_process_results(doc: Dict, results: List[str]) -> Dict[str, float]
# Maze solving functions
def maze_doc_to_visual(doc: Dict) -> List[Image.Image]
def maze_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def maze_process_results(doc: Dict, results: List[str]) -> Dict[str, float]
# Sliding puzzle functions
def sliding_doc_to_visual(doc: Dict) -> List[Image.Image]
def sliding_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def sliding_process_results(doc: Dict, results: List[str]) -> Dict[str, float]
# Geometry problem functions
def geometry_doc_to_visual(doc: Dict) -> List[Image.Image]
def geometry_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def geometry_process_results(doc: Dict, results: List[str]) -> Dict[str, float]
# Helper functions
def _find_json_object(text: str) -> Optional[str]
def _parse_json_list(raw: str) -> List[Any]
def _normalize_geometry_answer(text: str) -> str
def _extract_final_answer(text: str) -> str
Import
from lmms_eval.tasks.uni_mmmu.utils import (
jigsaw_doc_to_text,
jigsaw_process_results,
maze_doc_to_text,
maze_process_results,
sliding_doc_to_text,
sliding_process_results,
geometry_doc_to_text,
geometry_process_results
)
I/O Contract
Jigsaw Puzzle I/O
| Input |
Type |
Description
|
| doc["ref_image"] |
Image |
2x2 reference with bottom-right hidden
|
| doc["cand0_image"] |
Image |
Candidate patch 0
|
| doc["cand1_image"] |
Image |
Candidate patch 1
|
| doc["label"] |
int |
Ground truth choice (0 or 1)
|
| Output |
Type |
Description
|
| exact_match |
float |
1.0 if predicted choice matches label, else 0.0
|
Maze/Sliding Puzzle I/O
| Input |
Type |
Description
|
| doc["initial_image"] |
Image |
Puzzle start state visualization
|
| doc["steps"] (maze) |
str/List |
Ground truth move sequence (JSON array)
|
| doc["steps_words"] (sliding) |
str/List |
Ground truth move words (JSON array)
|
| Output |
Type |
Description
|
| exact_match |
float |
1.0 if full sequence matches, else 0.0
|
| frame_accuracy |
float |
Proportion of moves correct (0.0 to 1.0)
|
Geometry Problem I/O
| Input |
Type |
Description
|
| doc["image"] |
Image |
Geometry diagram
|
| doc["question"] / doc["problem"] |
str |
Problem statement
|
| doc["answer"] / doc["solution_en"] |
str |
Ground truth answer
|
| Output |
Type |
Description
|
| exact_match |
float |
1.0 if normalized answers match, else 0.0
|
Usage Examples
# Jigsaw puzzle evaluation
jigsaw_doc = {
"ref_image": ref_img,
"cand0_image": cand0_img,
"cand1_image": cand1_img,
"label": 1
}
prompt = jigsaw_doc_to_text(jigsaw_doc)
# Model responds: "<FINAL_ANSWER_JSON>\n{\"choice\": 1, \"rationale\": \"...\"}\n</FINAL_ANSWER_JSON>"
result = jigsaw_process_results(jigsaw_doc, [model_response])
print(result["exact_match"]) # 1.0 (correct)
# Maze solving evaluation
maze_doc = {
"initial_image": maze_img,
"steps": "[\"right\", \"down\", \"right\", \"up\"]"
}
prompt = maze_doc_to_text(maze_doc)
# Model responds: "Let me solve... <ANSWER_JSON>[\"right\", \"down\", \"right\", \"up\"]</ANSWER_JSON>"
result = maze_process_results(maze_doc, [model_response])
print(result["exact_match"]) # 1.0
print(result["frame_accuracy"]) # 1.0
# Sliding puzzle with partial correctness
sliding_doc = {
"initial_image": sliding_img,
"steps_words": "[\"down\", \"right\", \"up\", \"left\"]"
}
model_output = "<ANSWER_JSON>[\"down\", \"right\", \"down\", \"left\"]</ANSWER_JSON>"
result = sliding_process_results(sliding_doc, [model_output])
print(result["exact_match"]) # 0.0 (not fully correct)
print(result["frame_accuracy"]) # 0.75 (3 out of 4 moves correct)
# Geometry problem evaluation
geom_doc = {
"image": geom_diagram,
"question": "Find the angle ABC if angle BAC is 30 degrees",
"answer": "60 degrees"
}
prompt = geometry_doc_to_text(geom_doc)
# Model responds: "Using auxiliary lines... The answer is 60°"
result = geometry_process_results(geom_doc, [model_response])
print(result["exact_match"]) # 1.0 (normalized: "60" == "60")
# Helper: Extract JSON from complex response
response = "Let me think... The answer is <ANSWER_JSON>[\"up\", \"down\"]</ANSWER_JSON> because..."
moves = _parse_json_list(response.split("<ANSWER_JSON>")[1].split("</ANSWER_JSON>")[0])
print(moves) # ["up", "down"]
# Helper: Normalize geometry answers
normalized = _normalize_geometry_answer("The answer is 45 degrees")
print(normalized) # "45"
Related Pages