Overview
MMUPD (Multi-Modal Understanding with Perturbation Detection) evaluator that assesses model robustness to three types of perturbations: Answer Addition (AAD), Image-Answer Semantic Disentangling (IASD), and Image-Visual Question Disentangling (IVQD).
Description
This module implements the official MMUPD evaluation framework that tests vision-language models' ability to handle perturbed multiple-choice questions. It uses a two-stage evaluation pipeline: (1) direct answer extraction from model outputs, and (2) GPT-based matching when answers cannot be directly inferred. The evaluator supports three perturbation types (aad, iasd, ivqd) and three question types (base, option, inst), calculating per-category and overall accuracy with optional GPT-4 fallback for ambiguous answers.
Usage
Use this when evaluating vision-language models on MMUPD benchmark to measure robustness to answer perturbations, semantic disentangling, and visual-question mismatches. The evaluator requires OpenAI API credentials for GPT-based answer matching when direct extraction fails.
Code Reference
Source Location
Signature
class MMUPD_Evaluator:
def __init__(
self,
sys_prompt: str = "There are several options:",
API_KEY: str = "",
API_URL: str = "",
model_version: str = "gpt-3.5-turbo-0613"
)
def eval_result(
self,
results: pd.DataFrame,
eval_method: str,
upd_type: str,
question_type: str,
eval_type: str
) -> Tuple[float, Dict[str, float], pd.DataFrame]
def can_infer_option(
self,
answer: str,
option_dict: Dict[str, str],
question_type: Optional[str] = None,
valid_option: Optional[List[str]] = None
) -> Union[str, bool]
def extract_answer_from_item(
self,
item: pd.Series,
gt_text: str,
eval_type: str,
question_type: str,
upd_type: str
) -> Tuple[str, str, List[str]]
Import
from lmms_eval.tasks.mmupd.mmupd_evals import MMUPD_Evaluator, load, dump
I/O Contract
Constructor Inputs
| Parameter |
Type |
Description
|
| sys_prompt |
str |
System prompt for option presentation (default: "There are several options:")
|
| API_KEY |
str |
OpenAI API key for GPT-based answer matching
|
| API_URL |
str |
API endpoint URL
|
| model_version |
str |
GPT model version (default: "gpt-3.5-turbo-0613")
|
eval_result Inputs
| Parameter |
Type |
Description
|
| results |
pd.DataFrame |
DataFrame containing model predictions and ground truth
|
| eval_method |
str |
Evaluation method (must be "openai")
|
| upd_type |
str |
Perturbation type: "aad", "iasd", or "ivqd"
|
| question_type |
str |
Question format: "base", "option", or "inst"
|
| eval_type |
str |
Evaluation mode: "standard", "aad", "iasd", or "ivqd"
|
eval_result Outputs
| Field |
Type |
Description
|
| overall_hit_rate |
float |
Overall accuracy across all questions
|
| category_hit_rate |
Dict[str, float] |
Per-category accuracy scores
|
| data_main |
pd.DataFrame |
Result DataFrame with per-sample hit indicators
|
Usage Examples
# Initialize evaluator with API credentials
evaluator = MMUPD_Evaluator(
API_KEY="your-api-key",
API_URL="https://api.openai.com/v1/chat/completions",
model_version="gpt-3.5-turbo-0613"
)
# Evaluate AAD (Answer Addition Detection) results
import pandas as pd
results_df = pd.read_csv("model_predictions.csv")
overall_acc, category_acc, detailed_results = evaluator.eval_result(
results=results_df,
eval_method="openai",
upd_type="aad",
question_type="inst",
eval_type="aad"
)
print(f"Overall Accuracy: {overall_acc:.3f}")
for category, acc in category_acc.items():
print(f"{category}: {acc:.3f}")
# Direct answer inference without API call
answer = "A. The cat is sleeping"
choices = {"A": "cat", "B": "dog", "C": "bird"}
inferred = evaluator.can_infer(answer, choices, question_type="base")
print(f"Inferred option: {inferred}") # Output: "A"
# Calculate dual accuracy (standard + perturbation)
standard_acc, std_category, std_df = evaluator.eval_result(
results_df, "openai", "aad", "inst", "standard"
)
upd_acc, upd_category, upd_df = evaluator.eval_result(
results_df, "openai", "aad", "inst", "aad"
)
dual_acc, dual_category, dual_df = evaluator.calculate_dual_acc(std_df, upd_df)
Related Pages