Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval WildVision Bench Evaluation Utils

From Leeroopedia
Knowledge Sources
Domains Vision, Evaluation, LLM_Judge
Last Updated 2026-02-14 00:00 GMT

Overview

Utility functions for evaluating vision-language models on WildVision Bench using GPT-4o as judge to compare model responses against a baseline model through pairwise comparison.

Description

This module implements the WildVision Bench evaluation framework using GPT-4o (or Azure OpenAI) as an impartial judge. It conducts pairwise comparisons between a baseline model (configurable via metadata) and the evaluation model, asking the judge to determine which response is better using a 5-point scale: [[A>>B]] (baseline significantly better), [[A>B]] (baseline slightly better), A=B (tie), [[B>A]] (eval model slightly better), [[B>>A]] (eval model significantly better). The module computes multiple metrics: raw scores (-2 to +2), ELO ratings via logistic regression, win rates, and judgement distribution percentages.

Usage

Use this when evaluating vision-language models on open-ended visual question answering against a baseline model. Configure API_TYPE environment variable ("openai" or "azure") and corresponding API keys. Set baseline_model in task metadata YAML (default: baseline model name). The judge evaluates helpfulness, relevance, conciseness, and creativity, requiring models to provide comprehensive responses to complex visual queries.

Code Reference

Source Location

Signature

def wild_vision_doc_to_visual(doc: Dict) -> List[Image.Image]
def wild_vision_doc_to_text(
    doc: Dict,
    lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def wild_vision_doc_to_target(doc: Dict) -> str

def wild_vision_process_results(doc: Dict, results: List[str]) -> Dict[str, Dict]

def wild_vision_aggregation_raw_scores(results: List[Dict]) -> float
def wild_vision_aggregation_elo_scores(results: List[Dict]) -> float
def wild_vision_aggregation_win_rates(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_better(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_better_plus(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_worse(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_worse_plus(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_tie(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_unclear(results: List[Dict]) -> float

def get_chat_response(
    base64_image: str,
    prompt: str,
    max_retries: int = 5,
    wait_time: int = 10
) -> Tuple[str, str]

def compute_mle_elo(
    df: pd.DataFrame,
    baseline: str,
    SCALE: int = 400,
    BASE: int = 10,
    INIT_RATING: int = 1000
) -> pd.DataFrame

def predict_win_rate(
    elo_ratings: Dict[str, float],
    SCALE: int = 400,
    BASE: int = 10,
    INIT_RATING: int = 1000
) -> pd.DataFrame

def image_to_base64(pil_image: Image.Image) -> str

Import

from lmms_eval.tasks.wild_vision_bench.utils import (
    wild_vision_doc_to_visual,
    wild_vision_doc_to_text,
    wild_vision_process_results,
    wild_vision_aggregation_elo_scores,
    wild_vision_aggregation_win_rates
)

I/O Contract

Environment Variables

Variable Default Description
API_TYPE "openai" API provider: "openai" or "azure"
OPENAI_API_URL "https://api.openai.com/v1/chat/completions" OpenAI endpoint
OPENAI_API_KEY "YOUR_API_KEY" OpenAI API key
AZURE_ENDPOINT "https://api.cognitive.microsoft.com/..." Azure endpoint
AZURE_API_KEY "YOUR_API_KEY" Azure API key

wild_vision_process_results Input

Field Type Description
doc["instruction"] str User prompt/question about the image
doc["image"] Image.Image PIL image to evaluate
doc[BASELINE_MODEL_NAME] str Baseline model's response (from metadata config)
results List[str] Evaluation model's response

wild_vision_process_results Output

Metric Key Fields Description
raw_scores final_score Numeric score: -2 (worse++), -1 (worse), 0 (tie/unclear), +1 (better), +2 (better++)
elo_scores question, model_a, model_b, winner, gpt_resps, model_resps, judgement ELO calculation data
win_rates question, model_a, model_b, winner Win rate calculation data
judgements_better judgement "Better" if B>A
judgements_better_plus judgement "Better++" if B>>A
judgements_worse judgement "Worse" if A>B
judgements_worse_plus judgement "Worse++" if A>>B
judgements_tie judgement "Tie" if A=B
judgements_unclear judgement "Unclear" if parsing failed

Aggregation Functions

Function Return Description
wild_vision_aggregation_raw_scores float Average raw score (-2 to +2)
wild_vision_aggregation_elo_scores float ELO-based win rate vs baseline (0-100)
wild_vision_aggregation_win_rates float Direct win rate percentage (0-100)
wild_vision_aggregation_judgements_better float Percentage of "Better" judgements
wild_vision_aggregation_judgements_better_plus float Percentage of "Better++" judgements
wild_vision_aggregation_judgements_worse float Percentage of "Worse" judgements
wild_vision_aggregation_judgements_worse_plus float Percentage of "Worse++" judgements
wild_vision_aggregation_judgements_tie float Percentage of "Tie" judgements
wild_vision_aggregation_judgements_unclear float Percentage of unclear judgements

Usage Examples

import os
from lmms_eval.tasks.wild_vision_bench.utils import (
    wild_vision_doc_to_visual,
    wild_vision_doc_to_text,
    wild_vision_process_results,
    wild_vision_aggregation_elo_scores,
    wild_vision_aggregation_win_rates
)

# Configure API (OpenAI example)
os.environ["API_TYPE"] = "openai"
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["OPENAI_API_URL"] = "https://api.openai.com/v1/chat/completions"

# Or use Azure
os.environ["API_TYPE"] = "azure"
os.environ["AZURE_API_KEY"] = "your-azure-key"
os.environ["AZURE_ENDPOINT"] = "https://your-deployment.openai.azure.com/..."

# Prepare document (baseline_model from metadata)
doc = {
    "instruction": "What is unusual about this image?",
    "image": pil_image,
    "gpt4o": "The image shows a cat sitting at a computer desk, which is unusual because cats typically don't use computers."
}

# Generate prompt and get visuals
prompt = wild_vision_doc_to_text(doc, {"pre_prompt": "Please answer: ", "post_prompt": ""})
visuals = wild_vision_doc_to_visual(doc)

# Process model response
eval_model_response = [
    "This image is unusual because it depicts a cat positioned at a computer, "
    "appearing to work or browse, which is an anthropomorphic scenario not "
    "typically observed in reality."
]
results = wild_vision_process_results(doc, eval_model_response)

print(f"Raw Score: {results['raw_scores']['final_score']}")  # e.g., 1 (B>A)
print(f"Winner: {results['elo_scores']['winner']}")  # e.g., "model_b"
print(f"Judgement: {results['elo_scores']['judgement']}")  # e.g., "Better"
print(f"GPT Reasoning: {results['elo_scores']['gpt_resps']}")

# Aggregate across dataset
all_elo_data = [
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "tie"},
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
]
elo_score = wild_vision_aggregation_elo_scores(all_elo_data)
print(f"ELO-based win rate: {elo_score:.2f}%")  # e.g., 65.23%

all_win_data = [
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_a"},
    {"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
]
win_rate = wild_vision_aggregation_win_rates(all_win_data)
print(f"Direct win rate: {win_rate:.2f}%")  # 66.67% (2/3 wins)

# Judgement distribution
judgements = [
    {"judgement": "Better"},
    {"judgement": "Better++"},
    {"judgement": "Worse"},
    {"judgement": "Better"}
]
better_pct = wild_vision_aggregation_judgements_better(judgements)
better_plus_pct = wild_vision_aggregation_judgements_better_plus(judgements)
worse_pct = wild_vision_aggregation_judgements_worse(judgements)
print(f"Better: {better_pct:.1f}%, Better++: {better_plus_pct:.1f}%, Worse: {worse_pct:.1f}%")
# Better: 50.0%, Better++: 25.0%, Worse: 25.0%

# Convert image to base64 for API
from lmms_eval.tasks.wild_vision_bench.utils import image_to_base64
base64_img = image_to_base64(pil_image)
print(f"Base64 length: {len(base64_img)}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment