Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils

Source File: `lmms_eval/tasks/screenspot/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The ScreenSpot Utils module provides functions for evaluating GUI element captioning tasks. Given an image with a highlighted bounding box, models generate descriptions of the UI element. Evaluation uses COCO captioning metrics (primarily CIDEr) to compare generated descriptions against reference instructions.

Key Functions

Document Processing

screenspot_bbox_doc_to_visual(doc)

Prepares image with highlighted bounding box

Extracts bounding box coordinates from document
Converts image to RGB format
Draws red rectangle (width=3) around the target UI element
Returns list containing the annotated image

screenspot_doc_to_text(doc)

Generates the question prompt

Formats bounding box coordinates to 2 decimal places
Creates instruction to describe the highlighted region
Returns formatted string with bbox coordinates

Results Processing

screenspot_process_result(doc, result)

Processes model prediction for a single instance

Extracts prediction from result list (empty string if no result)
Collects metadata: annotation ID, instruction, data type, data source
Returns dictionary mapping each metric name to the data dictionary
Enables per-metric and per-category analysis

Metrics Aggregation

screenspot_aggregation_result(results, metric)

Aggregates predictions and computes specified metric

Creates COCO-format dataset structure:
- "annotations" list with ground truth instructions
- "images" list with image IDs
Builds results list with predictions
Uses COCO evaluation tools:
- COCO() for ground truth
- coco.loadRes() for predictions
- COCOEvalCap for metric computation
Tokenizes using PTBTokenizer
Handles Bleu metrics (returns specific n-gram score)
Returns scalar score for the metric

Metric-Specific Functions

screenspot_cider(results)

Computes CIDEr (Consensus-based Image Description Evaluation) score

Primary metric for ScreenSpot evaluation
Measures consensus with human references using TF-IDF weighting

screenspot_bleu1(results) through screenspot_bleu4(results)

Compute BLEU scores at different n-gram levels

BLEU-1: unigram precision
BLEU-2: bigram precision
BLEU-3: trigram precision
BLEU-4: 4-gram precision

screenspot_meteor(results)

Computes METEOR (Metric for Evaluation of Translation with Explicit ORdering)

Considers synonyms and stemming
Balances precision and recall

screenspot_rougel(results)

Computes ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation)

Based on longest common subsequence
Focuses on recall of reference content

screenspot_spice(results)

Computes SPICE (Semantic Propositional Image Caption Evaluation)

Uses scene graph matching
Currently commented out in COCO_METRICS

Configuration

Active Metrics

COCO_METRICS = ["CIDEr"]

The module defines CIDEr as the primary evaluation metric. Other metrics (BLEU, METEOR, ROUGE-L, SPICE) are implemented but not actively used in the default configuration.

Design Characteristics

Visual Annotation: Draws bounding boxes directly on images to highlight target elements
Standard Metrics: Uses established COCO captioning evaluation framework
Flexible Evaluation: Supports multiple metrics while focusing on CIDEr
Metadata Preservation: Tracks data source and type for detailed analysis
Dual Index System: Creates proper COCO index structure for evaluation tools

Dependencies

PIL.ImageDraw - Drawing bounding boxes on images
pycocoevalcap.eval - COCO evaluation metrics (Bleu, Cider, Meteor, Rouge)
pycocoevalcap.tokenizer.ptbtokenizer.PTBTokenizer - Text tokenization
pycocotools.coco.COCO - COCO dataset handling
loguru.logger - Logging progress information

Usage Context

This module supports the ScreenSpot benchmark for GUI understanding. It evaluates models' ability to generate natural language descriptions of highlighted UI elements, measuring how well the descriptions match human-written instructions for interacting with those elements.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment