Implementation:EvolvingLMMs Lab Lmms eval MME CoT Utils
Location: /tmp/kapso_repo_sslb_59s/lmms_eval/tasks/mme_cot/utils.py
Principle: Task_Utility_Functions
Purpose
Task-specific utilities for MME-CoT (Chain-of-Thought) evaluation using LLM-as-judge scoring for multimodal reasoning tasks.
Configuration
- API_TYPE: from environment (default: "openai")
- GPT_MODEL: from MODEL_VERSION environment (default: "gpt-4o-2024-11-20")
- ServerConfig initialized for LLM judge server
Key Functions
mmecot_doc_to_visual
def mmecot_doc_to_visual(doc)
Processes base64-encoded images:
- Decodes each base64 image from doc["image"]
- Converts to PIL RGB images
- Returns list of visual objects
mmecot_doc_to_text
def mmecot_doc_to_text(doc, lmms_eval_specific_kwargs=None)
Formats question with prompts:
- Applies pre_prompt and post_prompt from kwargs
- Extracts multiple-choice options (A-Z) from doc
- Formats options with letter labels
- Adds postfix based on postfix_type:
- "direct": "Please directly provide the final answer without any other output."
- "cot": "Please generate a step by step answer, include all your intermediate reasoning process, and provide the final answer at the end."
mmecot_process_results
def mmecot_process_results(doc, results)
Processes results using LLM judge:
Custom Judge Prompt:
"You are given a question, the solution and the correct answer. Please determine if the solution matches the correct answer. Focus only on the mathematical or semantic correctness of the content. Ignore any differences in formatting, such as LaTeX syntax, symbols, styles, or additional wrappers (e.g., \boxed, $...$, or similar). Compare only the core mathematical or textual meaning of the solution and the correct answer. The process or reasoning leading to the Solution is irrelevant, ONLY the correctness of the result matters. Return only "Yes" if the solution is correct or "No" if it is incorrect. Only return "Yes" or "No" with no additional text or formatting."
Evaluation:
- Uses server.evaluate_binary() with output_format="yes/no"
- Converts "yes" to 1, otherwise 0
- Logs errors on judge failure
- Returns:
- "submission": dict with index and parsed predictions
- "llm_as_judge_eval": binary judge result (0 or 1)
mmecot_reasoning_aggregate_results
def mmecot_reasoning_aggregate_results(results, args)
Saves reasoning test results:
- Generates submission file: "mmecot_reasoning_test_for_submission.json"
- Writes results as JSON
- Logs save location
mmecot_direct_aggregate_results
def mmecot_direct_aggregate_results(results, args)
Saves direct test results:
- Generates submission file: "mmecot_direct_test_for_submission.json"
- Writes results as JSON
- Logs save location
Implementation Details
- Two evaluation modes: direct answer vs. chain-of-thought reasoning
- LLM judge focuses on semantic correctness, ignoring formatting
- Base64 image decoding for multi-image inputs
- Multiple-choice options extracted dynamically from document
- Submission files generated for both reasoning and direct modes