Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils Rec
Source File: `lmms_eval/tasks/screenspot/utils_rec.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The ScreenSpot Utils Rec module implements referring expression comprehension (REC) evaluation for GUI elements. Given an instruction, models must predict the bounding box coordinates of the target UI element. Evaluation uses spatial metrics including IoU, accuracy at various thresholds, and center point accuracy.
Key Functions
Document Processing
screenspot_rec_doc_to_visual(doc)- Prepares image without annotations
- Converts image to RGB format
- Returns clean image (no bounding box drawn)
- Model must locate element based on text instruction alone
screenspot_rec_doc_to_text(doc)- Generates the question prompt with format specification
- Provides bounding box format instructions
- Specifies coordinate system (0-1 normalized, 2 decimal places)
- Appends the user instruction for the target element
- Returns formatted prompt string
Prediction Parsing
parse_float_sequence_within(input_str)- Extracts bounding box coordinates from model response
- Uses regex to find pattern:
[float, float, float, float] - Pattern:
\[x1, y1, x2, y2\]with optional whitespace - Handles negative numbers and decimal values
- Returns list of 4 floats if found, otherwise
[0, 0, 0, 0]
- Uses regex to find pattern:
Results Processing
screenspot_rec_process_result(doc, result)- Processes model prediction for a single instance
- Extracts prediction string and parses to bbox coordinates
- Collects metadata: instruction, annotation ID, ground truth bbox, data type, data source
- Returns dictionary mapping each metric to the data dictionary
- Enables per-metric and per-category evaluation
Geometric Metrics
compute_iou(box1, box2)- Computes Intersection over Union of two bounding boxes
- Input format:
[x_min, y_min, x_max, y_max] - Calculates intersection area using max/min operations
- Computes union as: area1 + area2 - intersection
- Returns IoU ratio (0.0 to 1.0)
- Input format:
compute_accuracy(box1, box2, threshold=0.5)- Binary accuracy based on IoU threshold
- Computes IoU between boxes
- Returns True if IoU ≥ threshold, False otherwise
- Used for threshold-based accuracy metrics
compute_center_accuracy(box1, box2)- Checks if predicted box's center falls within ground truth box
- Calculates center point of box2:
((x_min + x_max) / 2, (y_min + y_max) / 2) - Returns True if center point is within box1 boundaries
- More lenient than full box overlap
- Calculates center point of box2:
Metrics Aggregation
screenspot_rec_aggregation_result(results, metric)- Aggregates predictions and computes specified metric
- Creates scorer functions for each metric type
- Tracks overall scores and category-specific scores:
- Mobile (iOS, Android) vs Desktop (macOS, Windows) vs Web
- Text elements vs Icon elements
- Iterates through results, applying metric scorer
- Computes mean scores for each category
- Prints detailed per-category results
- Returns overall mean score
Metric-Specific Functions
screenspot_rec_iou(results)- Average Intersection over Union score
- Measures continuous overlap quality
- Range: 0.0 (no overlap) to 1.0 (perfect match)
screenspot_rec_acc01(results)throughscreenspot_rec_acc09(results)- Accuracy at different IoU thresholds
- ACC@0.1: Very lenient (10% overlap required)
- ACC@0.3: Lenient (30% overlap required)
- ACC@0.5: Standard (50% overlap required)
- ACC@0.7: Strict (70% overlap required)
- ACC@0.9: Very strict (90% overlap required)
screenspot_rec_center_acc(results)- Center point accuracy
- Measures if predicted box's center is within ground truth box
- Useful for click-based interactions
- More forgiving than full box IoU
Configuration
Active Metrics
REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5",
"ACC@0.7", "ACC@0.9", "Center_ACC"]
Category Breakdown
For each metric, scores are computed for:
- Overall performance
- Mobile text / Mobile icon
- Web text / Web icon
- Desktop text / Desktop icon
Design Characteristics
- Multi-Threshold Evaluation: Tests model performance at various precision levels
- Category-Specific Analysis: Breaks down performance by platform and element type
- Robust Parsing: Regex-based extraction handles various response formats
- Lenient Fallback: Returns zero bbox when parsing fails (rather than crashing)
- Center-Based Metric: Includes click-oriented metric useful for GUI automation
- Comprehensive Reporting: Prints per-category scores during aggregation
Dependencies
re- Regular expression for coordinate extraction
Usage Context
This module supports the ScreenSpot REC task where models must locate GUI elements given natural language instructions. It evaluates spatial understanding across different platforms (mobile, web, desktop) and element types (text, icons), using multiple metrics to assess localization accuracy at various precision levels.