Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils
Source File: `lmms_eval/tasks/screenspot/utils.py`
Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]
Overview
The ScreenSpot Utils module provides functions for evaluating GUI element captioning tasks. Given an image with a highlighted bounding box, models generate descriptions of the UI element. Evaluation uses COCO captioning metrics (primarily CIDEr) to compare generated descriptions against reference instructions.
Key Functions
Document Processing
screenspot_bbox_doc_to_visual(doc)- Prepares image with highlighted bounding box
- Extracts bounding box coordinates from document
- Converts image to RGB format
- Draws red rectangle (width=3) around the target UI element
- Returns list containing the annotated image
screenspot_doc_to_text(doc)- Generates the question prompt
- Formats bounding box coordinates to 2 decimal places
- Creates instruction to describe the highlighted region
- Returns formatted string with bbox coordinates
Results Processing
screenspot_process_result(doc, result)- Processes model prediction for a single instance
- Extracts prediction from result list (empty string if no result)
- Collects metadata: annotation ID, instruction, data type, data source
- Returns dictionary mapping each metric name to the data dictionary
- Enables per-metric and per-category analysis
Metrics Aggregation
screenspot_aggregation_result(results, metric)- Aggregates predictions and computes specified metric
- Creates COCO-format dataset structure:
- "annotations" list with ground truth instructions
- "images" list with image IDs
- Builds results list with predictions
- Uses COCO evaluation tools:
COCO()for ground truthcoco.loadRes()for predictionsCOCOEvalCapfor metric computation
- Tokenizes using PTBTokenizer
- Handles Bleu metrics (returns specific n-gram score)
- Returns scalar score for the metric
- Creates COCO-format dataset structure:
Metric-Specific Functions
screenspot_cider(results)- Computes CIDEr (Consensus-based Image Description Evaluation) score
- Primary metric for ScreenSpot evaluation
- Measures consensus with human references using TF-IDF weighting
screenspot_bleu1(results)throughscreenspot_bleu4(results)- Compute BLEU scores at different n-gram levels
- BLEU-1: unigram precision
- BLEU-2: bigram precision
- BLEU-3: trigram precision
- BLEU-4: 4-gram precision
screenspot_meteor(results)- Computes METEOR (Metric for Evaluation of Translation with Explicit ORdering)
- Considers synonyms and stemming
- Balances precision and recall
screenspot_rougel(results)- Computes ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation)
- Based on longest common subsequence
- Focuses on recall of reference content
screenspot_spice(results)- Computes SPICE (Semantic Propositional Image Caption Evaluation)
- Uses scene graph matching
- Currently commented out in COCO_METRICS
Configuration
Active Metrics
COCO_METRICS = ["CIDEr"]
The module defines CIDEr as the primary evaluation metric. Other metrics (BLEU, METEOR, ROUGE-L, SPICE) are implemented but not actively used in the default configuration.
Design Characteristics
- Visual Annotation: Draws bounding boxes directly on images to highlight target elements
- Standard Metrics: Uses established COCO captioning evaluation framework
- Flexible Evaluation: Supports multiple metrics while focusing on CIDEr
- Metadata Preservation: Tracks data source and type for detailed analysis
- Dual Index System: Creates proper COCO index structure for evaluation tools
Dependencies
PIL.ImageDraw- Drawing bounding boxes on imagespycocoevalcap.eval- COCO evaluation metrics (Bleu, Cider, Meteor, Rouge)pycocoevalcap.tokenizer.ptbtokenizer.PTBTokenizer- Text tokenizationpycocotools.coco.COCO- COCO dataset handlingloguru.logger- Logging progress information
Usage Context
This module supports the ScreenSpot benchmark for GUI understanding. It evaluates models' ability to generate natural language descriptions of highlighted UI elements, measuring how well the descriptions match human-written instructions for interacting with those elements.