Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval MME CoT Utils

From Leeroopedia

Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/mme_cot/utils.py

Principle: Task_Utility_Functions

Purpose

Task-specific utilities for MME-CoT (Chain-of-Thought) evaluation using LLM-as-judge scoring for multimodal reasoning tasks.

Configuration

  • API_TYPE: from environment (default: "openai")
  • GPT_MODEL: from MODEL_VERSION environment (default: "gpt-4o-2024-11-20")
  • ServerConfig initialized for LLM judge server

Key Functions

mmecot_doc_to_visual

def mmecot_doc_to_visual(doc)

Processes base64-encoded images:

  • Decodes each base64 image from doc["image"]
  • Converts to PIL RGB images
  • Returns list of visual objects

mmecot_doc_to_text

def mmecot_doc_to_text(doc, lmms_eval_specific_kwargs=None)

Formats question with prompts:

  • Applies pre_prompt and post_prompt from kwargs
  • Extracts multiple-choice options (A-Z) from doc
  • Formats options with letter labels
  • Adds postfix based on postfix_type:
    • "direct": "Please directly provide the final answer without any other output."
    • "cot": "Please generate a step by step answer, include all your intermediate reasoning process, and provide the final answer at the end."

mmecot_process_results

def mmecot_process_results(doc, results)

Processes results using LLM judge:

Custom Judge Prompt:

"You are given a question, the solution and the correct answer. Please determine if the solution matches the correct answer. Focus only on the mathematical or semantic correctness of the content. Ignore any differences in formatting, such as LaTeX syntax, symbols, styles, or additional wrappers (e.g., \boxed, $...$, or similar). Compare only the core mathematical or textual meaning of the solution and the correct answer. The process or reasoning leading to the Solution is irrelevant, ONLY the correctness of the result matters. Return only "Yes" if the solution is correct or "No" if it is incorrect. Only return "Yes" or "No" with no additional text or formatting."

Evaluation:

  • Uses server.evaluate_binary() with output_format="yes/no"
  • Converts "yes" to 1, otherwise 0
  • Logs errors on judge failure
  • Returns:
    • "submission": dict with index and parsed predictions
    • "llm_as_judge_eval": binary judge result (0 or 1)

mmecot_reasoning_aggregate_results

def mmecot_reasoning_aggregate_results(results, args)

Saves reasoning test results:

  • Generates submission file: "mmecot_reasoning_test_for_submission.json"
  • Writes results as JSON
  • Logs save location

mmecot_direct_aggregate_results

def mmecot_direct_aggregate_results(results, args)

Saves direct test results:

  • Generates submission file: "mmecot_direct_test_for_submission.json"
  • Writes results as JSON
  • Logs save location

Implementation Details

  • Two evaluation modes: direct answer vs. chain-of-thought reasoning
  • LLM judge focuses on semantic correctness, ignoring formatting
  • Base64 image decoding for multi-image inputs
  • Multiple-choice options extracted dynamically from document
  • Submission files generated for both reasoning and direct modes

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment