Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils Rec

Source File: `lmms_eval/tasks/screenspot/utils_rec.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The ScreenSpot Utils Rec module implements referring expression comprehension (REC) evaluation for GUI elements. Given an instruction, models must predict the bounding box coordinates of the target UI element. Evaluation uses spatial metrics including IoU, accuracy at various thresholds, and center point accuracy.

Key Functions

Document Processing

screenspot_rec_doc_to_visual(doc)

Prepares image without annotations

Converts image to RGB format
Returns clean image (no bounding box drawn)
Model must locate element based on text instruction alone

screenspot_rec_doc_to_text(doc)

Generates the question prompt with format specification

Provides bounding box format instructions
Specifies coordinate system (0-1 normalized, 2 decimal places)
Appends the user instruction for the target element
Returns formatted prompt string

Prediction Parsing

parse_float_sequence_within(input_str)

Extracts bounding box coordinates from model response

Uses regex to find pattern: [float, float, float, float]
Pattern: \[x1, y1, x2, y2\] with optional whitespace
Handles negative numbers and decimal values
Returns list of 4 floats if found, otherwise [0, 0, 0, 0]

Results Processing

screenspot_rec_process_result(doc, result)

Processes model prediction for a single instance

Extracts prediction string and parses to bbox coordinates
Collects metadata: instruction, annotation ID, ground truth bbox, data type, data source
Returns dictionary mapping each metric to the data dictionary
Enables per-metric and per-category evaluation

Geometric Metrics

compute_iou(box1, box2)

Computes Intersection over Union of two bounding boxes

Input format: [x_min, y_min, x_max, y_max]
Calculates intersection area using max/min operations
Computes union as: area1 + area2 - intersection
Returns IoU ratio (0.0 to 1.0)

compute_accuracy(box1, box2, threshold=0.5)

Binary accuracy based on IoU threshold

Computes IoU between boxes
Returns True if IoU ≥ threshold, False otherwise
Used for threshold-based accuracy metrics

compute_center_accuracy(box1, box2)

Checks if predicted box's center falls within ground truth box

Calculates center point of box2: ((x_min + x_max) / 2, (y_min + y_max) / 2)
Returns True if center point is within box1 boundaries
More lenient than full box overlap

Metrics Aggregation

screenspot_rec_aggregation_result(results, metric)

Aggregates predictions and computes specified metric

Creates scorer functions for each metric type
Tracks overall scores and category-specific scores:
- Mobile (iOS, Android) vs Desktop (macOS, Windows) vs Web
- Text elements vs Icon elements
Iterates through results, applying metric scorer
Computes mean scores for each category
Prints detailed per-category results
Returns overall mean score

Metric-Specific Functions

screenspot_rec_iou(results)

Average Intersection over Union score

Measures continuous overlap quality
Range: 0.0 (no overlap) to 1.0 (perfect match)

screenspot_rec_acc01(results) through screenspot_rec_acc09(results)

Accuracy at different IoU thresholds

ACC@0.1: Very lenient (10% overlap required)
ACC@0.3: Lenient (30% overlap required)
ACC@0.5: Standard (50% overlap required)
ACC@0.7: Strict (70% overlap required)
ACC@0.9: Very strict (90% overlap required)

screenspot_rec_center_acc(results)

Center point accuracy

Measures if predicted box's center is within ground truth box
Useful for click-based interactions
More forgiving than full box IoU

Configuration

Active Metrics

REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5",
               "ACC@0.7", "ACC@0.9", "Center_ACC"]

Category Breakdown

For each metric, scores are computed for:

Overall performance
Mobile text / Mobile icon
Web text / Web icon
Desktop text / Desktop icon

Design Characteristics

Multi-Threshold Evaluation: Tests model performance at various precision levels
Category-Specific Analysis: Breaks down performance by platform and element type
Robust Parsing: Regex-based extraction handles various response formats
Lenient Fallback: Returns zero bbox when parsing fails (rather than crashing)
Center-Based Metric: Includes click-oriented metric useful for GUI automation
Comprehensive Reporting: Prints per-category scores during aggregation

Dependencies

re - Regular expression for coordinate extraction

Usage Context

This module supports the ScreenSpot REC task where models must locate GUI elements given natural language instructions. It evaluates spatial understanding across different platforms (mobile, web, desktop) and element types (text, icons), using multiple metrics to assess localization accuracy at various precision levels.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment