Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval ScreenSpot Utils

From Leeroopedia

Source File: `lmms_eval/tasks/screenspot/utils.py`

Principle: [[../principles/EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions|Task_Utility_Functions]]

Overview

The ScreenSpot Utils module provides functions for evaluating GUI element captioning tasks. Given an image with a highlighted bounding box, models generate descriptions of the UI element. Evaluation uses COCO captioning metrics (primarily CIDEr) to compare generated descriptions against reference instructions.

Key Functions

Document Processing

screenspot_bbox_doc_to_visual(doc)
Prepares image with highlighted bounding box
  • Extracts bounding box coordinates from document
  • Converts image to RGB format
  • Draws red rectangle (width=3) around the target UI element
  • Returns list containing the annotated image
screenspot_doc_to_text(doc)
Generates the question prompt
  • Formats bounding box coordinates to 2 decimal places
  • Creates instruction to describe the highlighted region
  • Returns formatted string with bbox coordinates

Results Processing

screenspot_process_result(doc, result)
Processes model prediction for a single instance
  • Extracts prediction from result list (empty string if no result)
  • Collects metadata: annotation ID, instruction, data type, data source
  • Returns dictionary mapping each metric name to the data dictionary
  • Enables per-metric and per-category analysis

Metrics Aggregation

screenspot_aggregation_result(results, metric)
Aggregates predictions and computes specified metric
  • Creates COCO-format dataset structure:
    • "annotations" list with ground truth instructions
    • "images" list with image IDs
  • Builds results list with predictions
  • Uses COCO evaluation tools:
    • COCO() for ground truth
    • coco.loadRes() for predictions
    • COCOEvalCap for metric computation
  • Tokenizes using PTBTokenizer
  • Handles Bleu metrics (returns specific n-gram score)
  • Returns scalar score for the metric

Metric-Specific Functions

screenspot_cider(results)
Computes CIDEr (Consensus-based Image Description Evaluation) score
  • Primary metric for ScreenSpot evaluation
  • Measures consensus with human references using TF-IDF weighting
screenspot_bleu1(results) through screenspot_bleu4(results)
Compute BLEU scores at different n-gram levels
  • BLEU-1: unigram precision
  • BLEU-2: bigram precision
  • BLEU-3: trigram precision
  • BLEU-4: 4-gram precision
screenspot_meteor(results)
Computes METEOR (Metric for Evaluation of Translation with Explicit ORdering)
  • Considers synonyms and stemming
  • Balances precision and recall
screenspot_rougel(results)
Computes ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation)
  • Based on longest common subsequence
  • Focuses on recall of reference content
screenspot_spice(results)
Computes SPICE (Semantic Propositional Image Caption Evaluation)
  • Uses scene graph matching
  • Currently commented out in COCO_METRICS

Configuration

Active Metrics

COCO_METRICS = ["CIDEr"]

The module defines CIDEr as the primary evaluation metric. Other metrics (BLEU, METEOR, ROUGE-L, SPICE) are implemented but not actively used in the default configuration.

Design Characteristics

  • Visual Annotation: Draws bounding boxes directly on images to highlight target elements
  • Standard Metrics: Uses established COCO captioning evaluation framework
  • Flexible Evaluation: Supports multiple metrics while focusing on CIDEr
  • Metadata Preservation: Tracks data source and type for detailed analysis
  • Dual Index System: Creates proper COCO index structure for evaluation tools

Dependencies

  • PIL.ImageDraw - Drawing bounding boxes on images
  • pycocoevalcap.eval - COCO evaluation metrics (Bleu, Cider, Meteor, Rouge)
  • pycocoevalcap.tokenizer.ptbtokenizer.PTBTokenizer - Text tokenization
  • pycocotools.coco.COCO - COCO dataset handling
  • loguru.logger - Logging progress information

Usage Context

This module supports the ScreenSpot benchmark for GUI understanding. It evaluates models' ability to generate natural language descriptions of highlighted UI elements, measuring how well the descriptions match human-written instructions for interacting with those elements.

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment