Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval MMUPD Evaluation Engine

From Leeroopedia
Revision as of 12:31, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/EvolvingLMMs_Lab_Lmms_eval_MMUPD_Evaluation_Engine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Vision, Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

MMUPD (Multi-Modal Understanding with Perturbation Detection) evaluator that assesses model robustness to three types of perturbations: Answer Addition (AAD), Image-Answer Semantic Disentangling (IASD), and Image-Visual Question Disentangling (IVQD).

Description

This module implements the official MMUPD evaluation framework that tests vision-language models' ability to handle perturbed multiple-choice questions. It uses a two-stage evaluation pipeline: (1) direct answer extraction from model outputs, and (2) GPT-based matching when answers cannot be directly inferred. The evaluator supports three perturbation types (aad, iasd, ivqd) and three question types (base, option, inst), calculating per-category and overall accuracy with optional GPT-4 fallback for ambiguous answers.

Usage

Use this when evaluating vision-language models on MMUPD benchmark to measure robustness to answer perturbations, semantic disentangling, and visual-question mismatches. The evaluator requires OpenAI API credentials for GPT-based answer matching when direct extraction fails.

Code Reference

Source Location

Signature

class MMUPD_Evaluator:
    def __init__(
        self,
        sys_prompt: str = "There are several options:",
        API_KEY: str = "",
        API_URL: str = "",
        model_version: str = "gpt-3.5-turbo-0613"
    )

    def eval_result(
        self,
        results: pd.DataFrame,
        eval_method: str,
        upd_type: str,
        question_type: str,
        eval_type: str
    ) -> Tuple[float, Dict[str, float], pd.DataFrame]

    def can_infer_option(
        self,
        answer: str,
        option_dict: Dict[str, str],
        question_type: Optional[str] = None,
        valid_option: Optional[List[str]] = None
    ) -> Union[str, bool]

    def extract_answer_from_item(
        self,
        item: pd.Series,
        gt_text: str,
        eval_type: str,
        question_type: str,
        upd_type: str
    ) -> Tuple[str, str, List[str]]

Import

from lmms_eval.tasks.mmupd.mmupd_evals import MMUPD_Evaluator, load, dump

I/O Contract

Constructor Inputs

Parameter Type Description
sys_prompt str System prompt for option presentation (default: "There are several options:")
API_KEY str OpenAI API key for GPT-based answer matching
API_URL str API endpoint URL
model_version str GPT model version (default: "gpt-3.5-turbo-0613")

eval_result Inputs

Parameter Type Description
results pd.DataFrame DataFrame containing model predictions and ground truth
eval_method str Evaluation method (must be "openai")
upd_type str Perturbation type: "aad", "iasd", or "ivqd"
question_type str Question format: "base", "option", or "inst"
eval_type str Evaluation mode: "standard", "aad", "iasd", or "ivqd"

eval_result Outputs

Field Type Description
overall_hit_rate float Overall accuracy across all questions
category_hit_rate Dict[str, float] Per-category accuracy scores
data_main pd.DataFrame Result DataFrame with per-sample hit indicators

Usage Examples

# Initialize evaluator with API credentials
evaluator = MMUPD_Evaluator(
    API_KEY="your-api-key",
    API_URL="https://api.openai.com/v1/chat/completions",
    model_version="gpt-3.5-turbo-0613"
)

# Evaluate AAD (Answer Addition Detection) results
import pandas as pd
results_df = pd.read_csv("model_predictions.csv")
overall_acc, category_acc, detailed_results = evaluator.eval_result(
    results=results_df,
    eval_method="openai",
    upd_type="aad",
    question_type="inst",
    eval_type="aad"
)

print(f"Overall Accuracy: {overall_acc:.3f}")
for category, acc in category_acc.items():
    print(f"{category}: {acc:.3f}")

# Direct answer inference without API call
answer = "A. The cat is sleeping"
choices = {"A": "cat", "B": "dog", "C": "bird"}
inferred = evaluator.can_infer(answer, choices, question_type="base")
print(f"Inferred option: {inferred}")  # Output: "A"

# Calculate dual accuracy (standard + perturbation)
standard_acc, std_category, std_df = evaluator.eval_result(
    results_df, "openai", "aad", "inst", "standard"
)
upd_acc, upd_category, upd_df = evaluator.eval_result(
    results_df, "openai", "aad", "inst", "aad"
)
dual_acc, dual_category, dual_df = evaluator.calculate_dual_acc(std_df, upd_df)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment