Implementation:EvolvingLMMs Lab Lmms eval MME CoT Utils

Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/mme_cot/utils.py

Purpose

Task-specific utilities for MME-CoT (Chain-of-Thought) evaluation using LLM-as-judge scoring for multimodal reasoning tasks.

Configuration

API_TYPE: from environment (default: "openai")
GPT_MODEL: from MODEL_VERSION environment (default: "gpt-4o-2024-11-20")
ServerConfig initialized for LLM judge server

Key Functions

mmecot_doc_to_visual

def mmecot_doc_to_visual(doc)

Processes base64-encoded images:

Decodes each base64 image from doc["image"]
Converts to PIL RGB images
Returns list of visual objects

mmecot_doc_to_text

def mmecot_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Formats question with prompts:

Applies pre_prompt and post_prompt from kwargs
Extracts multiple-choice options (A-Z) from doc
Formats options with letter labels
Adds postfix based on postfix_type:
- "direct": "Please directly provide the final answer without any other output."
- "cot": "Please generate a step by step answer, include all your intermediate reasoning process, and provide the final answer at the end."

mmecot_process_results

def mmecot_process_results(doc, results)

Processes results using LLM judge:

Custom Judge Prompt:

"You are given a question, the solution and the correct answer. Please determine if the solution matches the correct answer. Focus only on the mathematical or semantic correctness of the content. Ignore any differences in formatting, such as LaTeX syntax, symbols, styles, or additional wrappers (e.g., \boxed, $...$, or similar). Compare only the core mathematical or textual meaning of the solution and the correct answer. The process or reasoning leading to the Solution is irrelevant, ONLY the correctness of the result matters. Return only "Yes" if the solution is correct or "No" if it is incorrect. Only return "Yes" or "No" with no additional text or formatting."

Evaluation:

Uses server.evaluate_binary() with output_format="yes/no"
Converts "yes" to 1, otherwise 0
Logs errors on judge failure
Returns:
- "submission": dict with index and parsed predictions
- "llm_as_judge_eval": binary judge result (0 or 1)

mmecot_reasoning_aggregate_results

def mmecot_reasoning_aggregate_results(results, args)

Saves reasoning test results:

Generates submission file: "mmecot_reasoning_test_for_submission.json"
Writes results as JSON
Logs save location

mmecot_direct_aggregate_results

def mmecot_direct_aggregate_results(results, args)

Saves direct test results:

Generates submission file: "mmecot_direct_test_for_submission.json"
Writes results as JSON
Logs save location

Implementation Details

Two evaluation modes: direct answer vs. chain-of-thought reasoning
LLM judge focuses on semantic correctness, ignoring formatting
Base64 image decoding for multi-image inputs
Multiple-choice options extracted dynamically from document
Submission files generated for both reasoning and direct modes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment