Overview
Utility functions for evaluating vision-language models on WildVision Bench using GPT-4o as judge to compare model responses against a baseline model through pairwise comparison.
Description
This module implements the WildVision Bench evaluation framework using GPT-4o (or Azure OpenAI) as an impartial judge. It conducts pairwise comparisons between a baseline model (configurable via metadata) and the evaluation model, asking the judge to determine which response is better using a 5-point scale: [[A>>B]] (baseline significantly better), [[A>B]] (baseline slightly better), A=B (tie), [[B>A]] (eval model slightly better), [[B>>A]] (eval model significantly better). The module computes multiple metrics: raw scores (-2 to +2), ELO ratings via logistic regression, win rates, and judgement distribution percentages.
Usage
Use this when evaluating vision-language models on open-ended visual question answering against a baseline model. Configure API_TYPE environment variable ("openai" or "azure") and corresponding API keys. Set baseline_model in task metadata YAML (default: baseline model name). The judge evaluates helpfulness, relevance, conciseness, and creativity, requiring models to provide comprehensive responses to complex visual queries.
Code Reference
Source Location
Signature
def wild_vision_doc_to_visual(doc: Dict) -> List[Image.Image]
def wild_vision_doc_to_text(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def wild_vision_doc_to_target(doc: Dict) -> str
def wild_vision_process_results(doc: Dict, results: List[str]) -> Dict[str, Dict]
def wild_vision_aggregation_raw_scores(results: List[Dict]) -> float
def wild_vision_aggregation_elo_scores(results: List[Dict]) -> float
def wild_vision_aggregation_win_rates(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_better(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_better_plus(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_worse(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_worse_plus(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_tie(results: List[Dict]) -> float
def wild_vision_aggregation_judgements_unclear(results: List[Dict]) -> float
def get_chat_response(
base64_image: str,
prompt: str,
max_retries: int = 5,
wait_time: int = 10
) -> Tuple[str, str]
def compute_mle_elo(
df: pd.DataFrame,
baseline: str,
SCALE: int = 400,
BASE: int = 10,
INIT_RATING: int = 1000
) -> pd.DataFrame
def predict_win_rate(
elo_ratings: Dict[str, float],
SCALE: int = 400,
BASE: int = 10,
INIT_RATING: int = 1000
) -> pd.DataFrame
def image_to_base64(pil_image: Image.Image) -> str
Import
from lmms_eval.tasks.wild_vision_bench.utils import (
wild_vision_doc_to_visual,
wild_vision_doc_to_text,
wild_vision_process_results,
wild_vision_aggregation_elo_scores,
wild_vision_aggregation_win_rates
)
I/O Contract
Environment Variables
wild_vision_process_results Input
| Field |
Type |
Description
|
| doc["instruction"] |
str |
User prompt/question about the image
|
| doc["image"] |
Image.Image |
PIL image to evaluate
|
| doc[BASELINE_MODEL_NAME] |
str |
Baseline model's response (from metadata config)
|
| results |
List[str] |
Evaluation model's response
|
wild_vision_process_results Output
| Metric Key |
Fields |
Description
|
| raw_scores |
final_score |
Numeric score: -2 (worse++), -1 (worse), 0 (tie/unclear), +1 (better), +2 (better++)
|
| elo_scores |
question, model_a, model_b, winner, gpt_resps, model_resps, judgement |
ELO calculation data
|
| win_rates |
question, model_a, model_b, winner |
Win rate calculation data
|
| judgements_better |
judgement |
"Better" if B>A
|
| judgements_better_plus |
judgement |
"Better++" if B>>A
|
| judgements_worse |
judgement |
"Worse" if A>B
|
| judgements_worse_plus |
judgement |
"Worse++" if A>>B
|
| judgements_tie |
judgement |
"Tie" if A=B
|
| judgements_unclear |
judgement |
"Unclear" if parsing failed
|
Aggregation Functions
| Function |
Return |
Description
|
| wild_vision_aggregation_raw_scores |
float |
Average raw score (-2 to +2)
|
| wild_vision_aggregation_elo_scores |
float |
ELO-based win rate vs baseline (0-100)
|
| wild_vision_aggregation_win_rates |
float |
Direct win rate percentage (0-100)
|
| wild_vision_aggregation_judgements_better |
float |
Percentage of "Better" judgements
|
| wild_vision_aggregation_judgements_better_plus |
float |
Percentage of "Better++" judgements
|
| wild_vision_aggregation_judgements_worse |
float |
Percentage of "Worse" judgements
|
| wild_vision_aggregation_judgements_worse_plus |
float |
Percentage of "Worse++" judgements
|
| wild_vision_aggregation_judgements_tie |
float |
Percentage of "Tie" judgements
|
| wild_vision_aggregation_judgements_unclear |
float |
Percentage of unclear judgements
|
Usage Examples
import os
from lmms_eval.tasks.wild_vision_bench.utils import (
wild_vision_doc_to_visual,
wild_vision_doc_to_text,
wild_vision_process_results,
wild_vision_aggregation_elo_scores,
wild_vision_aggregation_win_rates
)
# Configure API (OpenAI example)
os.environ["API_TYPE"] = "openai"
os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["OPENAI_API_URL"] = "https://api.openai.com/v1/chat/completions"
# Or use Azure
os.environ["API_TYPE"] = "azure"
os.environ["AZURE_API_KEY"] = "your-azure-key"
os.environ["AZURE_ENDPOINT"] = "https://your-deployment.openai.azure.com/..."
# Prepare document (baseline_model from metadata)
doc = {
"instruction": "What is unusual about this image?",
"image": pil_image,
"gpt4o": "The image shows a cat sitting at a computer desk, which is unusual because cats typically don't use computers."
}
# Generate prompt and get visuals
prompt = wild_vision_doc_to_text(doc, {"pre_prompt": "Please answer: ", "post_prompt": ""})
visuals = wild_vision_doc_to_visual(doc)
# Process model response
eval_model_response = [
"This image is unusual because it depicts a cat positioned at a computer, "
"appearing to work or browse, which is an anthropomorphic scenario not "
"typically observed in reality."
]
results = wild_vision_process_results(doc, eval_model_response)
print(f"Raw Score: {results['raw_scores']['final_score']}") # e.g., 1 (B>A)
print(f"Winner: {results['elo_scores']['winner']}") # e.g., "model_b"
print(f"Judgement: {results['elo_scores']['judgement']}") # e.g., "Better"
print(f"GPT Reasoning: {results['elo_scores']['gpt_resps']}")
# Aggregate across dataset
all_elo_data = [
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "tie"},
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
]
elo_score = wild_vision_aggregation_elo_scores(all_elo_data)
print(f"ELO-based win rate: {elo_score:.2f}%") # e.g., 65.23%
all_win_data = [
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_a"},
{"model_a": "gpt4o", "model_b": "evaluation_model", "winner": "model_b"},
]
win_rate = wild_vision_aggregation_win_rates(all_win_data)
print(f"Direct win rate: {win_rate:.2f}%") # 66.67% (2/3 wins)
# Judgement distribution
judgements = [
{"judgement": "Better"},
{"judgement": "Better++"},
{"judgement": "Worse"},
{"judgement": "Better"}
]
better_pct = wild_vision_aggregation_judgements_better(judgements)
better_plus_pct = wild_vision_aggregation_judgements_better_plus(judgements)
worse_pct = wild_vision_aggregation_judgements_worse(judgements)
print(f"Better: {better_pct:.1f}%, Better++: {better_plus_pct:.1f}%, Worse: {worse_pct:.1f}%")
# Better: 50.0%, Better++: 25.0%, Worse: 25.0%
# Convert image to base64 for API
from lmms_eval.tasks.wild_vision_bench.utils import image_to_base64
base64_img = image_to_base64(pil_image)
print(f"Base64 length: {len(base64_img)}")
Related Pages