Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval VLMs Are Biased Utils

From Leeroopedia

Source File: `lmms_eval/tasks/vlms_are_biased/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The VLMs Are Biased Utils module provides evaluation functions for the "VLMs Are Biased" benchmark, which assesses whether vision-language models exhibit biases in their predictions. It computes both accuracy (correctness) and bias ratio (alignment with expected bias), enabling analysis of model fairness and bias patterns across different topics.

Key Functions

Document Processing

vlms_are_biased_doc_to_visual(doc: dict[str, Any]) -> list
Extracts image from document
  • Converts image to RGB format
  • Returns list containing single image
  • Simple extraction with no preprocessing
vlms_are_biased_doc_to_text(doc: dict[str, Any], lmms_eval_specific_kwargs: Optional[Dict[str, str]] = None) -> str
Formats question text with optional prompt additions
  • Extracts prompt directly from document
  • Supports optional pre_prompt and post_prompt from kwargs
  • Returns formatted prompt string
  • Handles None kwargs with empty dict default

Results Processing

vlms_are_biased_process_results(doc: dict[str, Any], results: list[str]) -> dict[str, Any]
Processes model results and computes accuracy and bias metrics
  • Extracts prediction from results list
  • Normalizes all strings for comparison:
    • Converts to lowercase
    • Strips braces and whitespace
    • Handles "Yes"/"No" in various formats
  • Compares prediction with ground truth for accuracy
  • Compares prediction with expected bias to measure bias alignment
  • If exact match fails, attempts numeric extraction:
    • Extracts digits from all three strings
    • Compares numeric values
  • Returns dictionary with three metrics:
    • accuracy: Binary correctness (1.0 or 0.0)
    • bias_ratio: Binary bias alignment (1.0 or 0.0)
    • accuracy_by_topic: Dictionary with topic and correctness

Aggregation

vlms_are_biased_aggregate_by_topic(results: list[dict[str, Any]]) -> dict[str, float]
Aggregates results by topic category
  • Uses defaultdict to track counts per topic
  • For each result:
    • Extracts topic and correctness
    • Increments topic total count
    • Increments topic correct count if applicable
  • Computes accuracy per topic as correct/total
  • Adds overall accuracy across all topics
  • Returns dictionary mapping topic names to accuracy scores

Metrics

Accuracy

Measures whether the model's prediction matches the ground truth answer. This is the standard correctness metric.

Bias Ratio

Measures whether the model's prediction matches the expected bias for the question. This metric reveals when models align with known biases rather than objective truth. A high bias ratio indicates problematic bias alignment.

Accuracy by Topic

Breaks down accuracy performance across different topics (e.g., gender, race, age) to identify where biases are most prevalent. The aggregation function computes per-topic accuracy and overall accuracy.

Normalization Strategy

The module employs robust normalization to handle various response formats:

  1. Convert to lowercase
  2. Strip braces: {Yes}yes
  3. Strip whitespace
  4. If exact match fails, extract and compare numeric values

This handles cases where models format answers differently while maintaining the same semantic content.

Design Characteristics

  • Dual Metrics: Simultaneously tracks correctness and bias alignment
  • Topic Analysis: Enables fine-grained analysis of bias patterns across categories
  • Robust Comparison: Multiple normalization strategies for flexible matching
  • Numeric Fallback: Extracts numbers when text matching fails
  • Type Annotations: Uses Python type hints for clarity
  • Comprehensive Output: Returns both aggregate and per-instance metrics

Dependencies

  • collections.defaultdict - Efficient counting for topic aggregation
  • typing.Any, Dict, Optional - Type annotations

Usage Context

This module supports the "VLMs Are Biased" benchmark, which tests whether vision-language models exhibit demographic or social biases in their predictions. By computing both accuracy and bias ratio, it enables researchers to identify when models prioritize stereotypical associations over factual correctness. The topic-based breakdown reveals which bias categories are most problematic for different models.

Example Metrics

{
    "accuracy": 0.75,           # 75% of predictions correct
    "bias_ratio": 0.40,          # 40% of predictions match expected bias
    "accuracy_by_topic": {
        "gender": 0.82,
        "race": 0.71,
        "age": 0.73,
        "overall": 0.75
    }
}

A good model should have high accuracy and low bias ratio, indicating correct predictions that don't align with stereotypical biases.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment