Implementation:EvolvingLMMs Lab Lmms eval GroundingMe Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Computer Vision, Object Grounding, Bounding Box Detection
Last Updated	2026-02-14 00:00 GMT

Overview

Bounding box evaluation utilities for the GroundingMe benchmark that assesses visual grounding across multiple subtasks.

Description

This module provides comprehensive evaluation infrastructure for GroundingMe, a benchmark that tests models' ability to ground object descriptions to bounding boxes in images. It supports four main subtasks (Discriminative, Spatial, Limited visibility, and Rejection) with further breakdowns into 13 categories (appearance, component, text, state attributes; spatial relationships; counting; occlusion; small objects). The module handles document-to-input conversion, bounding box parsing from multiple coordinate formats (normalized [0,1], integer [0,999], MIMO resized space, Qwen resized space), IoU computation, accuracy calculation at multiple thresholds (0.5, 0.75, 0.9), and comprehensive metric aggregation across all subtasks and categories. It uses smart format detection to automatically select the best coordinate interpretation.

Usage

Use this module when evaluating visual grounding models on GroundingMe benchmark. The workflow involves: (1) using groundingme_doc_to_visual/groundingme_doc_to_text to prepare image and description prompts, (2) collecting model predictions in JSON format with bbox_2d field, (3) using groundingme_process_result to parse bboxes and compute IoU/accuracy metrics, (4) using specific aggregation functions (groundingme_iou, groundingme_acc05, groundingme_discriminative_macc, etc.) to compute final scores for overall metrics and specific subtask categories.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/groundingme/utils.py
Lines: 1-531

Signature

# Document conversion functions
def groundingme_doc_to_visual(doc: Dict[str, Any]) -> List[Any]:
    """Convert document to visual input for model evaluation."""

def groundingme_doc_to_text(doc: Dict[str, Any]) -> str:
    """Convert document to text prompt for model evaluation."""

# Result processing functions
def groundingme_process_result(doc, result):
    """Process prediction result and compute evaluation metrics."""

def parse_bbox(input_str: str) -> List[float]:
    """Extract bounding box from JSON format."""

# Coordinate conversion functions
def smart_resize_mimo(height: int, width: int, factor: int = 28,
                      min_pixels: int = 28*28*8, max_pixels: int = 28*28*4096):
    """Resize image for MIMO models with factor-divisible dimensions."""

def smart_resize_qwen(height: int, width: int, factor: int = 28,
                      min_pixels: int = 4*28*28, max_pixels: int = 16384*28*28):
    """Resize image for Qwen models with factor-divisible dimensions."""

def convert_bbox_from_mimo(bbox: List[float], width: int, height: int) -> List[float]:
    """Convert bbox coordinates from MIMO resized space to original image space."""

def convert_bbox_from_qwen(bbox: List[float], width: int, height: int) -> List[float]:
    """Convert bbox coordinates from Qwen resized space to original image space."""

# Metric computation functions
def compute_iou(box1, box2):
    """Compute Intersection over Union (IoU) of two bounding boxes."""

def compute_accuracy(iou, threshold=0.5):
    """Check if IoU meets the specified threshold."""

def compute_center_accuracy(box1, box2):
    """Check if the center point of box2 is within box1."""

# Aggregation functions
def groundingme_aggregation_result(results, metric):
    """Aggregate evaluation results by specified metric and subtask categories."""

def groundingme_iou(results):
    """Aggregate IoU scores."""

def groundingme_acc05(results):
    """Aggregate accuracy at 0.5 IoU threshold."""

def groundingme_acc075(results):
    """Aggregate accuracy at 0.75 IoU threshold."""

def groundingme_acc09(results):
    """Aggregate accuracy at 0.9 IoU threshold."""

def groundingme_center_acc(results):
    """Aggregate center accuracy."""

def groundingme_macc(results):
    """Aggregate mean accuracy across all thresholds."""

# Subtask-specific aggregations (L1)
def groundingme_discriminative_acc05(results):
    """Aggregate Discriminative subtask accuracy at 0.5."""

def groundingme_spatial_acc05(results):
    """Aggregate Spatial subtask accuracy at 0.5."""

def groundingme_limited_acc05(results):
    """Aggregate Limited subtask accuracy at 0.5."""

def groundingme_rejection_acc(results):
    """Aggregate Rejection subtask accuracy."""

# Category-specific aggregations (L2) - examples
def groundingme_d_appearance_acc05(results):
    """Aggregate Discriminative-Appearance accuracy at 0.5."""

def groundingme_d_component_macc(results):
    """Aggregate Discriminative-Component mean accuracy."""

def groundingme_relationship_acc075(results):
    """Aggregate Spatial-Relationship accuracy at 0.75."""

def groundingme_counting_macc(results):
    """Aggregate Spatial-Counting mean accuracy."""

def groundingme_occlusion_acc09(results):
    """Aggregate Limited-Occlusion accuracy at 0.9."""

def groundingme_small_macc(results):
    """Aggregate Limited-Small mean accuracy."""

def groundingme_r_appearance_acc(results):
    """Aggregate Rejection-Appearance accuracy."""

Import

from lmms_eval.tasks.groundingme.utils import (
    groundingme_doc_to_visual,
    groundingme_doc_to_text,
    groundingme_process_result,
    parse_bbox,
    compute_iou,
    groundingme_iou,
    groundingme_acc05,
    groundingme_discriminative_macc,
    groundingme_spatial_acc075,
    groundingme_d_appearance_acc05
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document with image, description, bbox, subtask_l1, subtask_l2, id, height, width
result	list	Yes	List with single prediction string containing JSON bbox
input_str	str	Yes	String to parse for bounding box (JSON or raw coordinates)
box1	list	Yes	First bounding box [x_min, y_min, x_max, y_max]
box2	list	Yes	Second bounding box [x_min, y_min, x_max, y_max]
height	int	Yes	Original image height in pixels
width	int	Yes	Original image width in pixels
threshold	float	No	IoU threshold for accuracy (default 0.5)

Outputs

Name	Type	Description
doc_to_visual return	list	List with single PIL Image in RGB format
doc_to_text return	str	Formatted prompt with description and instructions
process_result return	dict	Dict with 65 metric keys mapping to data_dict
parse_bbox return	list	Bounding box [x1, y1, x2, y2] or [0, 0, 0, 0] if not found
compute_iou return	float	IoU value between 0.0 and 1.0
compute_accuracy return	bool	True if IoU >= threshold, else False
compute_center_accuracy return	bool	True if center of box2 is within box1
convert_bbox functions return	list	Converted bounding box in original image coordinates
smart_resize functions return	tuple	(width, height) after resizing with constraints
aggregation functions return	float	Aggregated metric score across samples

Usage Examples

# Example 1: Document processing
doc = {
    "id": "groundingme_001",
    "image": PIL.Image.open("image.jpg"),
    "description": "A red car on the left side of the street",
    "bbox": [100, 150, 300, 400],
    "subtask_l1": "Discriminative",
    "subtask_l2": "Appearance",
    "height": 800,
    "width": 1200
}

# Get visual input
visual = groundingme_doc_to_visual(doc)
# Returns: [<PIL.Image.Image in RGB>]

# Get text prompt
text = groundingme_doc_to_text(doc)
# Returns: "All spatial relationships are defined from the viewer's perspective...
#           A red car on the left side of the street
#           Provide bbox as JSON: {\"bbox_2d\": [x1, y1, x2, y2]}..."

# Example 2: Parse bounding box from model output
from lmms_eval.tasks.groundingme.utils import parse_bbox

# JSON format (preferred)
output1 = '{"bbox_2d": [120, 160, 310, 390]}'
bbox1 = parse_bbox(output1)
# Returns: [120.0, 160.0, 310.0, 390.0]

# Null/no object format
output2 = '{"bbox_2d": null}'
bbox2 = parse_bbox(output2)
# Returns: [0, 0, 0, 0]

# Raw coordinates fallback
output3 = "The object is at coordinates 120 160 310 390"
bbox3 = parse_bbox(output3)
# Returns: [120.0, 160.0, 310.0, 390.0]

# Example 3: Compute IoU
from lmms_eval.tasks.groundingme.utils import compute_iou, compute_accuracy

ground_truth = [100, 150, 300, 400]
prediction = [120, 160, 310, 390]

iou = compute_iou(ground_truth, prediction)
# Returns: ~0.75 (IoU value)

acc_50 = compute_accuracy(iou, threshold=0.5)
# Returns: True (0.75 >= 0.5)

acc_90 = compute_accuracy(iou, threshold=0.9)
# Returns: False (0.75 < 0.9)

# Example 4: Process results with automatic format detection
result = ['{"bbox_2d": [0.1, 0.2, 0.3, 0.4]}']  # Normalized format
processed = groundingme_process_result(doc, result)
# Returns dict with 65 keys (one for each metric)
# Each value contains: subtask_l1, subtask_l2, description, pred, ann_id,
#                      bbox, iou, center_acc, acc_5, acc_75, acc_9, macc

# The function tries 4 coordinate interpretations and picks best IoU:
# 1. Original values
# 2. Normalized [0,1] or [0,999] to pixels
# 3. MIMO resized space to original
# 4. Qwen resized space to original

# Example 5: Aggregate metrics
results_list = [
    {"subtask_l1": "Discriminative", "subtask_l2": "Appearance",
     "iou": 0.85, "acc_5": True, "acc_75": True, "acc_9": False, "macc": 0.75},
    {"subtask_l1": "Discriminative", "subtask_l2": "Appearance",
     "iou": 0.65, "acc_5": True, "acc_75": False, "acc_9": False, "macc": 0.55},
    {"subtask_l1": "Spatial", "subtask_l2": "Relationship",
     "iou": 0.55, "acc_5": True, "acc_75": False, "acc_9": False, "macc": 0.50}
]

# Overall IoU
overall_iou = groundingme_iou(results_list)
# Returns: 0.683 (average across all samples)

# Overall accuracy at 0.5
overall_acc05 = groundingme_acc05(results_list)
# Returns: 1.0 (all samples have acc_5=True)

# Discriminative subtask mean accuracy
disc_macc = groundingme_discriminative_macc(results_list)
# Returns: 0.65 (average of 0.75 and 0.55, only Discriminative samples)

# Discriminative-Appearance accuracy at 0.5
d_app_acc05 = groundingme_d_appearance_acc05(results_list)
# Returns: 1.0 (both Discriminative-Appearance samples have acc_5=True)

# Spatial-Relationship accuracy at 0.75
rel_acc075 = groundingme_relationship_acc075(results_list)
# Returns: 0.0 (Relationship sample has acc_75=False)

Related Pages

Principle:EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment