Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval Uni MMMU Visual Puzzle Utils

From Leeroopedia
Revision as of 12:32, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_Uni_MMMU_Visual_Puzzle_Utils.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Vision, Evaluation, Puzzles, Geometry
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for evaluating vision-language models on Uni-MMMU benchmark, which tests visual puzzle solving across jigsaw completion, maze navigation, sliding puzzles, and geometry problems.

Description

This module provides specialized evaluation functions for four distinct visual puzzle types in Uni-MMMU: (1) Jigsaw puzzles requiring patch selection based on seam continuity and semantics, (2) Maze solving with path finding from start to goal, (3) Sliding puzzles requiring tile movement sequences, and (4) Geometry problems with auxiliary line construction and step-by-step solutions. Each puzzle type has custom prompt templates, answer extraction logic, and evaluation metrics (exact match for jigsaw, frame accuracy for sequential puzzles).

Usage

Use this when evaluating multimodal models on visual puzzle and spatial reasoning tasks. Each puzzle type has dedicated doc_to_text, doc_to_visual, and process_results functions. Jigsaw puzzles expect JSON output with choice (0 or 1), while maze and sliding puzzles expect move sequences in JSON arrays. Geometry problems use natural language extraction with normalization.

Code Reference

Source Location

Signature

# Jigsaw puzzle functions
def jigsaw_doc_to_visual(doc: Dict) -> List[Image.Image]
def jigsaw_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def jigsaw_process_results(doc: Dict, results: List[str]) -> Dict[str, float]

# Maze solving functions
def maze_doc_to_visual(doc: Dict) -> List[Image.Image]
def maze_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def maze_process_results(doc: Dict, results: List[str]) -> Dict[str, float]

# Sliding puzzle functions
def sliding_doc_to_visual(doc: Dict) -> List[Image.Image]
def sliding_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def sliding_process_results(doc: Dict, results: List[str]) -> Dict[str, float]

# Geometry problem functions
def geometry_doc_to_visual(doc: Dict) -> List[Image.Image]
def geometry_doc_to_text(doc: Dict, lmms_eval_specific_kwargs: Optional[Dict] = None) -> str
def geometry_process_results(doc: Dict, results: List[str]) -> Dict[str, float]

# Helper functions
def _find_json_object(text: str) -> Optional[str]
def _parse_json_list(raw: str) -> List[Any]
def _normalize_geometry_answer(text: str) -> str
def _extract_final_answer(text: str) -> str

Import

from lmms_eval.tasks.uni_mmmu.utils import (
    jigsaw_doc_to_text,
    jigsaw_process_results,
    maze_doc_to_text,
    maze_process_results,
    sliding_doc_to_text,
    sliding_process_results,
    geometry_doc_to_text,
    geometry_process_results
)

I/O Contract

Jigsaw Puzzle I/O

Input Type Description
doc["ref_image"] Image 2x2 reference with bottom-right hidden
doc["cand0_image"] Image Candidate patch 0
doc["cand1_image"] Image Candidate patch 1
doc["label"] int Ground truth choice (0 or 1)
Output Type Description
exact_match float 1.0 if predicted choice matches label, else 0.0

Maze/Sliding Puzzle I/O

Input Type Description
doc["initial_image"] Image Puzzle start state visualization
doc["steps"] (maze) str/List Ground truth move sequence (JSON array)
doc["steps_words"] (sliding) str/List Ground truth move words (JSON array)
Output Type Description
exact_match float 1.0 if full sequence matches, else 0.0
frame_accuracy float Proportion of moves correct (0.0 to 1.0)

Geometry Problem I/O

Input Type Description
doc["image"] Image Geometry diagram
doc["question"] / doc["problem"] str Problem statement
doc["answer"] / doc["solution_en"] str Ground truth answer
Output Type Description
exact_match float 1.0 if normalized answers match, else 0.0

Usage Examples

# Jigsaw puzzle evaluation
jigsaw_doc = {
    "ref_image": ref_img,
    "cand0_image": cand0_img,
    "cand1_image": cand1_img,
    "label": 1
}
prompt = jigsaw_doc_to_text(jigsaw_doc)
# Model responds: "<FINAL_ANSWER_JSON>\n{\"choice\": 1, \"rationale\": \"...\"}\n</FINAL_ANSWER_JSON>"
result = jigsaw_process_results(jigsaw_doc, [model_response])
print(result["exact_match"])  # 1.0 (correct)

# Maze solving evaluation
maze_doc = {
    "initial_image": maze_img,
    "steps": "[\"right\", \"down\", \"right\", \"up\"]"
}
prompt = maze_doc_to_text(maze_doc)
# Model responds: "Let me solve... <ANSWER_JSON>[\"right\", \"down\", \"right\", \"up\"]</ANSWER_JSON>"
result = maze_process_results(maze_doc, [model_response])
print(result["exact_match"])  # 1.0
print(result["frame_accuracy"])  # 1.0

# Sliding puzzle with partial correctness
sliding_doc = {
    "initial_image": sliding_img,
    "steps_words": "[\"down\", \"right\", \"up\", \"left\"]"
}
model_output = "<ANSWER_JSON>[\"down\", \"right\", \"down\", \"left\"]</ANSWER_JSON>"
result = sliding_process_results(sliding_doc, [model_output])
print(result["exact_match"])  # 0.0 (not fully correct)
print(result["frame_accuracy"])  # 0.75 (3 out of 4 moves correct)

# Geometry problem evaluation
geom_doc = {
    "image": geom_diagram,
    "question": "Find the angle ABC if angle BAC is 30 degrees",
    "answer": "60 degrees"
}
prompt = geometry_doc_to_text(geom_doc)
# Model responds: "Using auxiliary lines... The answer is 60°"
result = geometry_process_results(geom_doc, [model_response])
print(result["exact_match"])  # 1.0 (normalized: "60" == "60")

# Helper: Extract JSON from complex response
response = "Let me think... The answer is <ANSWER_JSON>[\"up\", \"down\"]</ANSWER_JSON> because..."
moves = _parse_json_list(response.split("<ANSWER_JSON>")[1].split("</ANSWER_JSON>")[0])
print(moves)  # ["up", "down"]

# Helper: Normalize geometry answers
normalized = _normalize_geometry_answer("The answer is 45 degrees")
print(normalized)  # "45"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment