Implementation:EvolvingLMMs Lab Lmms eval VLMs Are Biased Utils
Source File: `lmms_eval/tasks/vlms_are_biased/utils.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The VLMs Are Biased Utils module provides evaluation functions for the "VLMs Are Biased" benchmark, which assesses whether vision-language models exhibit biases in their predictions. It computes both accuracy (correctness) and bias ratio (alignment with expected bias), enabling analysis of model fairness and bias patterns across different topics.
Key Functions
Document Processing
vlms_are_biased_doc_to_visual(doc: dict[str, Any]) -> list- Extracts image from document
- Converts image to RGB format
- Returns list containing single image
- Simple extraction with no preprocessing
vlms_are_biased_doc_to_text(doc: dict[str, Any], lmms_eval_specific_kwargs: Optional[Dict[str, str]] = None) -> str- Formats question text with optional prompt additions
- Extracts prompt directly from document
- Supports optional pre_prompt and post_prompt from kwargs
- Returns formatted prompt string
- Handles None kwargs with empty dict default
Results Processing
vlms_are_biased_process_results(doc: dict[str, Any], results: list[str]) -> dict[str, Any]- Processes model results and computes accuracy and bias metrics
- Extracts prediction from results list
- Normalizes all strings for comparison:
- Converts to lowercase
- Strips braces and whitespace
- Handles "Yes"/"No" in various formats
- Compares prediction with ground truth for accuracy
- Compares prediction with expected bias to measure bias alignment
- If exact match fails, attempts numeric extraction:
- Extracts digits from all three strings
- Compares numeric values
- Returns dictionary with three metrics:
accuracy: Binary correctness (1.0 or 0.0)bias_ratio: Binary bias alignment (1.0 or 0.0)accuracy_by_topic: Dictionary with topic and correctness
Aggregation
vlms_are_biased_aggregate_by_topic(results: list[dict[str, Any]]) -> dict[str, float]- Aggregates results by topic category
- Uses defaultdict to track counts per topic
- For each result:
- Extracts topic and correctness
- Increments topic total count
- Increments topic correct count if applicable
- Computes accuracy per topic as correct/total
- Adds overall accuracy across all topics
- Returns dictionary mapping topic names to accuracy scores
Metrics
Accuracy
Measures whether the model's prediction matches the ground truth answer. This is the standard correctness metric.
Bias Ratio
Measures whether the model's prediction matches the expected bias for the question. This metric reveals when models align with known biases rather than objective truth. A high bias ratio indicates problematic bias alignment.
Accuracy by Topic
Breaks down accuracy performance across different topics (e.g., gender, race, age) to identify where biases are most prevalent. The aggregation function computes per-topic accuracy and overall accuracy.
Normalization Strategy
The module employs robust normalization to handle various response formats:
- Convert to lowercase
- Strip braces:
{Yes}→yes - Strip whitespace
- If exact match fails, extract and compare numeric values
This handles cases where models format answers differently while maintaining the same semantic content.
Design Characteristics
- Dual Metrics: Simultaneously tracks correctness and bias alignment
- Topic Analysis: Enables fine-grained analysis of bias patterns across categories
- Robust Comparison: Multiple normalization strategies for flexible matching
- Numeric Fallback: Extracts numbers when text matching fails
- Type Annotations: Uses Python type hints for clarity
- Comprehensive Output: Returns both aggregate and per-instance metrics
Dependencies
collections.defaultdict- Efficient counting for topic aggregationtyping.Any, Dict, Optional- Type annotations
Usage Context
This module supports the "VLMs Are Biased" benchmark, which tests whether vision-language models exhibit demographic or social biases in their predictions. By computing both accuracy and bias ratio, it enables researchers to identify when models prioritize stereotypical associations over factual correctness. The topic-based breakdown reveals which bias categories are most problematic for different models.
Example Metrics
{
"accuracy": 0.75, # 75% of predictions correct
"bias_ratio": 0.40, # 40% of predictions match expected bias
"accuracy_by_topic": {
"gender": 0.82,
"race": 0.71,
"age": 0.73,
"overall": 0.75
}
}
A good model should have high accuracy and low bias ratio, indicating correct predictions that don't align with stereotypical biases.