Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils Rec

From Leeroopedia

Source File: `lmms_eval/tasks/screenspot/utils_rec.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The ScreenSpot Utils Rec module implements referring expression comprehension (REC) evaluation for GUI elements. Given an instruction, models must predict the bounding box coordinates of the target UI element. Evaluation uses spatial metrics including IoU, accuracy at various thresholds, and center point accuracy.

Key Functions

Document Processing

screenspot_rec_doc_to_visual(doc)
Prepares image without annotations
  • Converts image to RGB format
  • Returns clean image (no bounding box drawn)
  • Model must locate element based on text instruction alone
screenspot_rec_doc_to_text(doc)
Generates the question prompt with format specification
  • Provides bounding box format instructions
  • Specifies coordinate system (0-1 normalized, 2 decimal places)
  • Appends the user instruction for the target element
  • Returns formatted prompt string

Prediction Parsing

parse_float_sequence_within(input_str)
Extracts bounding box coordinates from model response
  • Uses regex to find pattern: [float, float, float, float]
  • Pattern: \[x1, y1, x2, y2\] with optional whitespace
  • Handles negative numbers and decimal values
  • Returns list of 4 floats if found, otherwise [0, 0, 0, 0]

Results Processing

screenspot_rec_process_result(doc, result)
Processes model prediction for a single instance
  • Extracts prediction string and parses to bbox coordinates
  • Collects metadata: instruction, annotation ID, ground truth bbox, data type, data source
  • Returns dictionary mapping each metric to the data dictionary
  • Enables per-metric and per-category evaluation

Geometric Metrics

compute_iou(box1, box2)
Computes Intersection over Union of two bounding boxes
  • Input format: [x_min, y_min, x_max, y_max]
  • Calculates intersection area using max/min operations
  • Computes union as: area1 + area2 - intersection
  • Returns IoU ratio (0.0 to 1.0)
compute_accuracy(box1, box2, threshold=0.5)
Binary accuracy based on IoU threshold
  • Computes IoU between boxes
  • Returns True if IoU ≥ threshold, False otherwise
  • Used for threshold-based accuracy metrics
compute_center_accuracy(box1, box2)
Checks if predicted box's center falls within ground truth box
  • Calculates center point of box2: ((x_min + x_max) / 2, (y_min + y_max) / 2)
  • Returns True if center point is within box1 boundaries
  • More lenient than full box overlap

Metrics Aggregation

screenspot_rec_aggregation_result(results, metric)
Aggregates predictions and computes specified metric
  • Creates scorer functions for each metric type
  • Tracks overall scores and category-specific scores:
    • Mobile (iOS, Android) vs Desktop (macOS, Windows) vs Web
    • Text elements vs Icon elements
  • Iterates through results, applying metric scorer
  • Computes mean scores for each category
  • Prints detailed per-category results
  • Returns overall mean score

Metric-Specific Functions

screenspot_rec_iou(results)
Average Intersection over Union score
  • Measures continuous overlap quality
  • Range: 0.0 (no overlap) to 1.0 (perfect match)
screenspot_rec_acc01(results) through screenspot_rec_acc09(results)
Accuracy at different IoU thresholds
  • ACC@0.1: Very lenient (10% overlap required)
  • ACC@0.3: Lenient (30% overlap required)
  • ACC@0.5: Standard (50% overlap required)
  • ACC@0.7: Strict (70% overlap required)
  • ACC@0.9: Very strict (90% overlap required)
screenspot_rec_center_acc(results)
Center point accuracy
  • Measures if predicted box's center is within ground truth box
  • Useful for click-based interactions
  • More forgiving than full box IoU

Configuration

Active Metrics

REC_METRICS = ["IoU", "ACC@0.1", "ACC@0.3", "ACC@0.5",
               "ACC@0.7", "ACC@0.9", "Center_ACC"]

Category Breakdown

For each metric, scores are computed for:

  • Overall performance
  • Mobile text / Mobile icon
  • Web text / Web icon
  • Desktop text / Desktop icon

Design Characteristics

  • Multi-Threshold Evaluation: Tests model performance at various precision levels
  • Category-Specific Analysis: Breaks down performance by platform and element type
  • Robust Parsing: Regex-based extraction handles various response formats
  • Lenient Fallback: Returns zero bbox when parsing fails (rather than crashing)
  • Center-Based Metric: Includes click-oriented metric useful for GUI automation
  • Comprehensive Reporting: Prints per-category scores during aggregation

Dependencies

  • re - Regular expression for coordinate extraction

Usage Context

This module supports the ScreenSpot REC task where models must locate GUI elements given natural language instructions. It evaluates spatial understanding across different platforms (mobile, web, desktop) and element types (text, icons), using multiple metrics to assess localization accuracy at various precision levels.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment