Implementation:EvolvingLMMs Lab Lmms eval GroundingMe Utils
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Object Grounding, Bounding Box Detection |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Bounding box evaluation utilities for the GroundingMe benchmark that assesses visual grounding across multiple subtasks.
Description
This module provides comprehensive evaluation infrastructure for GroundingMe, a benchmark that tests models' ability to ground object descriptions to bounding boxes in images. It supports four main subtasks (Discriminative, Spatial, Limited visibility, and Rejection) with further breakdowns into 13 categories (appearance, component, text, state attributes; spatial relationships; counting; occlusion; small objects). The module handles document-to-input conversion, bounding box parsing from multiple coordinate formats (normalized [0,1], integer [0,999], MIMO resized space, Qwen resized space), IoU computation, accuracy calculation at multiple thresholds (0.5, 0.75, 0.9), and comprehensive metric aggregation across all subtasks and categories. It uses smart format detection to automatically select the best coordinate interpretation.
Usage
Use this module when evaluating visual grounding models on GroundingMe benchmark. The workflow involves: (1) using groundingme_doc_to_visual/groundingme_doc_to_text to prepare image and description prompts, (2) collecting model predictions in JSON format with bbox_2d field, (3) using groundingme_process_result to parse bboxes and compute IoU/accuracy metrics, (4) using specific aggregation functions (groundingme_iou, groundingme_acc05, groundingme_discriminative_macc, etc.) to compute final scores for overall metrics and specific subtask categories.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/groundingme/utils.py
- Lines: 1-531
Signature
# Document conversion functions
def groundingme_doc_to_visual(doc: Dict[str, Any]) -> List[Any]:
"""Convert document to visual input for model evaluation."""
def groundingme_doc_to_text(doc: Dict[str, Any]) -> str:
"""Convert document to text prompt for model evaluation."""
# Result processing functions
def groundingme_process_result(doc, result):
"""Process prediction result and compute evaluation metrics."""
def parse_bbox(input_str: str) -> List[float]:
"""Extract bounding box from JSON format."""
# Coordinate conversion functions
def smart_resize_mimo(height: int, width: int, factor: int = 28,
min_pixels: int = 28*28*8, max_pixels: int = 28*28*4096):
"""Resize image for MIMO models with factor-divisible dimensions."""
def smart_resize_qwen(height: int, width: int, factor: int = 28,
min_pixels: int = 4*28*28, max_pixels: int = 16384*28*28):
"""Resize image for Qwen models with factor-divisible dimensions."""
def convert_bbox_from_mimo(bbox: List[float], width: int, height: int) -> List[float]:
"""Convert bbox coordinates from MIMO resized space to original image space."""
def convert_bbox_from_qwen(bbox: List[float], width: int, height: int) -> List[float]:
"""Convert bbox coordinates from Qwen resized space to original image space."""
# Metric computation functions
def compute_iou(box1, box2):
"""Compute Intersection over Union (IoU) of two bounding boxes."""
def compute_accuracy(iou, threshold=0.5):
"""Check if IoU meets the specified threshold."""
def compute_center_accuracy(box1, box2):
"""Check if the center point of box2 is within box1."""
# Aggregation functions
def groundingme_aggregation_result(results, metric):
"""Aggregate evaluation results by specified metric and subtask categories."""
def groundingme_iou(results):
"""Aggregate IoU scores."""
def groundingme_acc05(results):
"""Aggregate accuracy at 0.5 IoU threshold."""
def groundingme_acc075(results):
"""Aggregate accuracy at 0.75 IoU threshold."""
def groundingme_acc09(results):
"""Aggregate accuracy at 0.9 IoU threshold."""
def groundingme_center_acc(results):
"""Aggregate center accuracy."""
def groundingme_macc(results):
"""Aggregate mean accuracy across all thresholds."""
# Subtask-specific aggregations (L1)
def groundingme_discriminative_acc05(results):
"""Aggregate Discriminative subtask accuracy at 0.5."""
def groundingme_spatial_acc05(results):
"""Aggregate Spatial subtask accuracy at 0.5."""
def groundingme_limited_acc05(results):
"""Aggregate Limited subtask accuracy at 0.5."""
def groundingme_rejection_acc(results):
"""Aggregate Rejection subtask accuracy."""
# Category-specific aggregations (L2) - examples
def groundingme_d_appearance_acc05(results):
"""Aggregate Discriminative-Appearance accuracy at 0.5."""
def groundingme_d_component_macc(results):
"""Aggregate Discriminative-Component mean accuracy."""
def groundingme_relationship_acc075(results):
"""Aggregate Spatial-Relationship accuracy at 0.75."""
def groundingme_counting_macc(results):
"""Aggregate Spatial-Counting mean accuracy."""
def groundingme_occlusion_acc09(results):
"""Aggregate Limited-Occlusion accuracy at 0.9."""
def groundingme_small_macc(results):
"""Aggregate Limited-Small mean accuracy."""
def groundingme_r_appearance_acc(results):
"""Aggregate Rejection-Appearance accuracy."""
Import
from lmms_eval.tasks.groundingme.utils import (
groundingme_doc_to_visual,
groundingme_doc_to_text,
groundingme_process_result,
parse_bbox,
compute_iou,
groundingme_iou,
groundingme_acc05,
groundingme_discriminative_macc,
groundingme_spatial_acc075,
groundingme_d_appearance_acc05
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document with image, description, bbox, subtask_l1, subtask_l2, id, height, width |
| result | list | Yes | List with single prediction string containing JSON bbox |
| input_str | str | Yes | String to parse for bounding box (JSON or raw coordinates) |
| box1 | list | Yes | First bounding box [x_min, y_min, x_max, y_max] |
| box2 | list | Yes | Second bounding box [x_min, y_min, x_max, y_max] |
| height | int | Yes | Original image height in pixels |
| width | int | Yes | Original image width in pixels |
| threshold | float | No | IoU threshold for accuracy (default 0.5) |
Outputs
| Name | Type | Description |
|---|---|---|
| doc_to_visual return | list | List with single PIL Image in RGB format |
| doc_to_text return | str | Formatted prompt with description and instructions |
| process_result return | dict | Dict with 65 metric keys mapping to data_dict |
| parse_bbox return | list | Bounding box [x1, y1, x2, y2] or [0, 0, 0, 0] if not found |
| compute_iou return | float | IoU value between 0.0 and 1.0 |
| compute_accuracy return | bool | True if IoU >= threshold, else False |
| compute_center_accuracy return | bool | True if center of box2 is within box1 |
| convert_bbox functions return | list | Converted bounding box in original image coordinates |
| smart_resize functions return | tuple | (width, height) after resizing with constraints |
| aggregation functions return | float | Aggregated metric score across samples |
Usage Examples
# Example 1: Document processing
doc = {
"id": "groundingme_001",
"image": PIL.Image.open("image.jpg"),
"description": "A red car on the left side of the street",
"bbox": [100, 150, 300, 400],
"subtask_l1": "Discriminative",
"subtask_l2": "Appearance",
"height": 800,
"width": 1200
}
# Get visual input
visual = groundingme_doc_to_visual(doc)
# Returns: [<PIL.Image.Image in RGB>]
# Get text prompt
text = groundingme_doc_to_text(doc)
# Returns: "All spatial relationships are defined from the viewer's perspective...
# A red car on the left side of the street
# Provide bbox as JSON: {\"bbox_2d\": [x1, y1, x2, y2]}..."
# Example 2: Parse bounding box from model output
from lmms_eval.tasks.groundingme.utils import parse_bbox
# JSON format (preferred)
output1 = '{"bbox_2d": [120, 160, 310, 390]}'
bbox1 = parse_bbox(output1)
# Returns: [120.0, 160.0, 310.0, 390.0]
# Null/no object format
output2 = '{"bbox_2d": null}'
bbox2 = parse_bbox(output2)
# Returns: [0, 0, 0, 0]
# Raw coordinates fallback
output3 = "The object is at coordinates 120 160 310 390"
bbox3 = parse_bbox(output3)
# Returns: [120.0, 160.0, 310.0, 390.0]
# Example 3: Compute IoU
from lmms_eval.tasks.groundingme.utils import compute_iou, compute_accuracy
ground_truth = [100, 150, 300, 400]
prediction = [120, 160, 310, 390]
iou = compute_iou(ground_truth, prediction)
# Returns: ~0.75 (IoU value)
acc_50 = compute_accuracy(iou, threshold=0.5)
# Returns: True (0.75 >= 0.5)
acc_90 = compute_accuracy(iou, threshold=0.9)
# Returns: False (0.75 < 0.9)
# Example 4: Process results with automatic format detection
result = ['{"bbox_2d": [0.1, 0.2, 0.3, 0.4]}'] # Normalized format
processed = groundingme_process_result(doc, result)
# Returns dict with 65 keys (one for each metric)
# Each value contains: subtask_l1, subtask_l2, description, pred, ann_id,
# bbox, iou, center_acc, acc_5, acc_75, acc_9, macc
# The function tries 4 coordinate interpretations and picks best IoU:
# 1. Original values
# 2. Normalized [0,1] or [0,999] to pixels
# 3. MIMO resized space to original
# 4. Qwen resized space to original
# Example 5: Aggregate metrics
results_list = [
{"subtask_l1": "Discriminative", "subtask_l2": "Appearance",
"iou": 0.85, "acc_5": True, "acc_75": True, "acc_9": False, "macc": 0.75},
{"subtask_l1": "Discriminative", "subtask_l2": "Appearance",
"iou": 0.65, "acc_5": True, "acc_75": False, "acc_9": False, "macc": 0.55},
{"subtask_l1": "Spatial", "subtask_l2": "Relationship",
"iou": 0.55, "acc_5": True, "acc_75": False, "acc_9": False, "macc": 0.50}
]
# Overall IoU
overall_iou = groundingme_iou(results_list)
# Returns: 0.683 (average across all samples)
# Overall accuracy at 0.5
overall_acc05 = groundingme_acc05(results_list)
# Returns: 1.0 (all samples have acc_5=True)
# Discriminative subtask mean accuracy
disc_macc = groundingme_discriminative_macc(results_list)
# Returns: 0.65 (average of 0.75 and 0.55, only Discriminative samples)
# Discriminative-Appearance accuracy at 0.5
d_app_acc05 = groundingme_d_appearance_acc05(results_list)
# Returns: 1.0 (both Discriminative-Appearance samples have acc_5=True)
# Spatial-Relationship accuracy at 0.75
rel_acc075 = groundingme_relationship_acc075(results_list)
# Returns: 0.0 (Relationship sample has acc_75=False)