Implementation:EvolvingLMMs Lab Lmms eval VLMs Are Biased Utils

Source File: `lmms_eval/tasks/vlms_are_biased/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VLMs Are Biased Utils module provides evaluation functions for the "VLMs Are Biased" benchmark, which assesses whether vision-language models exhibit biases in their predictions. It computes both accuracy (correctness) and bias ratio (alignment with expected bias), enabling analysis of model fairness and bias patterns across different topics.

Key Functions

Document Processing

vlms_are_biased_doc_to_visual(doc: dict[str, Any]) -> list

Extracts image from document

Converts image to RGB format
Returns list containing single image
Simple extraction with no preprocessing

vlms_are_biased_doc_to_text(doc: dict[str, Any], lmms_eval_specific_kwargs: Optional[Dict[str, str]] = None) -> str

Formats question text with optional prompt additions

Extracts prompt directly from document
Supports optional pre_prompt and post_prompt from kwargs
Returns formatted prompt string
Handles None kwargs with empty dict default

Results Processing

vlms_are_biased_process_results(doc: dict[str, Any], results: list[str]) -> dict[str, Any]

Processes model results and computes accuracy and bias metrics

Extracts prediction from results list
Normalizes all strings for comparison:
- Converts to lowercase
- Strips braces and whitespace
- Handles "Yes"/"No" in various formats
Compares prediction with ground truth for accuracy
Compares prediction with expected bias to measure bias alignment
If exact match fails, attempts numeric extraction:
- Extracts digits from all three strings
- Compares numeric values
Returns dictionary with three metrics:
- accuracy: Binary correctness (1.0 or 0.0)
- bias_ratio: Binary bias alignment (1.0 or 0.0)
- accuracy_by_topic: Dictionary with topic and correctness

Aggregation

vlms_are_biased_aggregate_by_topic(results: list[dict[str, Any]]) -> dict[str, float]

Aggregates results by topic category

Uses defaultdict to track counts per topic
For each result:
- Extracts topic and correctness
- Increments topic total count
- Increments topic correct count if applicable
Computes accuracy per topic as correct/total
Adds overall accuracy across all topics
Returns dictionary mapping topic names to accuracy scores

Metrics

Accuracy

Measures whether the model's prediction matches the ground truth answer. This is the standard correctness metric.

Bias Ratio

Measures whether the model's prediction matches the expected bias for the question. This metric reveals when models align with known biases rather than objective truth. A high bias ratio indicates problematic bias alignment.

Accuracy by Topic

Breaks down accuracy performance across different topics (e.g., gender, race, age) to identify where biases are most prevalent. The aggregation function computes per-topic accuracy and overall accuracy.

Normalization Strategy

The module employs robust normalization to handle various response formats:

Convert to lowercase
Strip braces: {Yes} → yes
Strip whitespace
If exact match fails, extract and compare numeric values

This handles cases where models format answers differently while maintaining the same semantic content.

Design Characteristics

Dual Metrics: Simultaneously tracks correctness and bias alignment
Topic Analysis: Enables fine-grained analysis of bias patterns across categories
Robust Comparison: Multiple normalization strategies for flexible matching
Numeric Fallback: Extracts numbers when text matching fails
Type Annotations: Uses Python type hints for clarity
Comprehensive Output: Returns both aggregate and per-instance metrics

Dependencies

collections.defaultdict - Efficient counting for topic aggregation
typing.Any, Dict, Optional - Type annotations

Usage Context

This module supports the "VLMs Are Biased" benchmark, which tests whether vision-language models exhibit demographic or social biases in their predictions. By computing both accuracy and bias ratio, it enables researchers to identify when models prioritize stereotypical associations over factual correctness. The topic-based breakdown reveals which bias categories are most problematic for different models.

Example Metrics

{
    "accuracy": 0.75,           # 75% of predictions correct
    "bias_ratio": 0.40,          # 40% of predictions match expected bias
    "accuracy_by_topic": {
        "gender": 0.82,
        "race": 0.71,
        "age": 0.73,
        "overall": 0.75
    }
}

A good model should have high accuracy and low bias ratio, indicating correct predictions that don't align with stereotypical biases.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment