Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval GEdit Bench Utils

From Leeroopedia
Knowledge Sources
Domains Computer Vision, Image Editing, Benchmark Evaluation
Last Updated 2026-02-14 00:00 GMT

Overview

Evaluation utilities for GEdit-Bench that assess image editing models using VIEScore across 11 task categories.

Description

This module provides the evaluation infrastructure for GEdit-Bench (General Editing Benchmark), which evaluates image editing models on 11 task types (background change, color alter, material alter, motion change, photoshop human, style change, subject add/remove/replace, text change, and tone transfer). It uses VIEScore (Visual Instruction Editing Score) to automatically assess edited images on three dimensions: semantics (instruction following), quality (visual fidelity), and overall score. The module handles document-to-input conversion, image resizing for evaluation, VIEScore computation, and comprehensive metric aggregation across multiple breakdowns (English/Chinese languages, fullset/intersection subsets, and task types).

Usage

Use this module when evaluating image editing models on GEdit-Bench. The workflow involves: (1) using gedit_bench_doc_to_visual/gedit_bench_doc_to_text to prepare inputs with editing instructions, (2) collecting model-edited images saved as JSON with image paths, (3) using gedit_bench_process_results to compute VIEScore metrics, (4) using aggregate functions to compute final scores across different breakdowns. The module automatically handles image resizing to 512x512 equivalent area for consistent evaluation.

Code Reference

Source Location

Signature

# Document conversion functions
def gedit_bench_doc_to_visual(doc):
    """Extract input image from document."""

def gedit_bench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
    """Extract instruction text with optional pre/post prompts."""

def gedit_bench_doc_to_target(doc):
    """Extract target instruction for reference."""

# Result processing functions
def gedit_bench_process_results(doc, results, **kwargs):
    """Process model predictions and evaluate using VIEScore."""

def gedit_bench_aggregate_results(results):
    """Aggregate overall results with detailed breakdowns."""

# Language and subset specific aggregations
def gedit_bench_aggregate_en_fullset_semantics(results):
    """Aggregate English fullset semantics scores."""

def gedit_bench_aggregate_en_fullset_quality(results):
    """Aggregate English fullset quality scores."""

def gedit_bench_aggregate_en_fullset_overall(results):
    """Aggregate English fullset overall scores."""

def gedit_bench_aggregate_en_intersection_semantics(results):
    """Aggregate English intersection semantics scores."""

def gedit_bench_aggregate_en_intersection_quality(results):
    """Aggregate English intersection quality scores."""

def gedit_bench_aggregate_en_intersection_overall(results):
    """Aggregate English intersection overall scores."""

def gedit_bench_aggregate_cn_fullset_semantics(results):
    """Aggregate Chinese fullset semantics scores."""

def gedit_bench_aggregate_cn_fullset_quality(results):
    """Aggregate Chinese fullset quality scores."""

def gedit_bench_aggregate_cn_fullset_overall(results):
    """Aggregate Chinese fullset overall scores."""

def gedit_bench_aggregate_cn_intersection_semantics(results):
    """Aggregate Chinese intersection semantics scores."""

def gedit_bench_aggregate_cn_intersection_quality(results):
    """Aggregate Chinese intersection quality scores."""

def gedit_bench_aggregate_cn_intersection_overall(results):
    """Aggregate Chinese intersection overall scores."""

# Helper functions
def calculate_dimensions(target_area, ratio):
    """Calculate dimensions maintaining aspect ratio."""

def _aggregate_by_filter(results, language=None, intersection_only=None):
    """Helper to aggregate scores with language and subset filters."""

Import

from lmms_eval.tasks.gedit_bench.utils import (
    gedit_bench_doc_to_visual,
    gedit_bench_doc_to_text,
    gedit_bench_process_results,
    gedit_bench_aggregate_results,
    gedit_bench_aggregate_en_fullset_overall,
    gedit_bench_aggregate_cn_intersection_semantics
)

I/O Contract

Inputs

Name Type Required Description
doc dict Yes Document with input_image, instruction, key, task_type, instruction_language, Intersection_exist
results list Yes List with single JSON string containing {"text": "...", "images": ["path.png"]}
lmms_eval_specific_kwargs dict No Optional pre_prompt and post_prompt for instruction formatting
target_area int Yes Target pixel area for resizing (typically 512*512)
ratio float Yes Aspect ratio (width/height) of image

Outputs

Name Type Description
doc_to_visual return list List with single PIL Image in RGB format
doc_to_text return str Formatted instruction text
process_results return dict Dict with 15 metric keys (overall + language/subset breakdowns)
aggregate functions return float Score averaged across filtered samples
calculate_dimensions return tuple (width, height, new_area) maintaining aspect ratio

Usage Examples

# Example 1: Document processing
doc = {
    "key": "sample_001",
    "input_image": PIL.Image.open("input.jpg"),
    "instruction": "Change the background to a beach scene",
    "task_type": "background_change",
    "instruction_language": "en",
    "Intersection_exist": True
}

# Get visual input
visual = gedit_bench_doc_to_visual(doc)
# Returns: [<PIL.Image.Image in RGB>]

# Get instruction text
text = gedit_bench_doc_to_text(doc, {
    "pre_prompt": "Edit the image: ",
    "post_prompt": ""
})
# Returns: "Edit the image: Change the background to a beach scene"

# Example 2: Process model predictions with VIEScore
import json

# Model output (JSON format)
model_output = json.dumps({
    "text": "Image edited successfully",
    "images": ["./output/gedit_bench/sample_001.png"]
})

results = [model_output]
processed = gedit_bench_process_results(doc, results)
# Returns dict with 15 keys:
# - gedit_bench_semantics_score
# - gedit_bench_quality_score
# - gedit_bench_overall_score
# - gedit_bench_en_fullset_semantics
# - gedit_bench_en_fullset_quality
# - gedit_bench_en_fullset_overall
# - gedit_bench_en_intersection_semantics
# - gedit_bench_en_intersection_quality
# - gedit_bench_en_intersection_overall
# - gedit_bench_cn_fullset_semantics
# - gedit_bench_cn_fullset_quality
# - gedit_bench_cn_fullset_overall
# - gedit_bench_cn_intersection_semantics
# - gedit_bench_cn_intersection_quality
# - gedit_bench_cn_intersection_overall

# Each value contains: {key, task_type, instruction_language, intersection_exist, score}

# Example 3: Calculate dimensions for resizing
from lmms_eval.tasks.gedit_bench.utils import calculate_dimensions

width, height, area = calculate_dimensions(
    target_area=512 * 512,  # 262144 pixels
    ratio=16/9  # Aspect ratio
)
# Returns: (682, 383, 261106) - maintains 16:9 ratio close to 512x512 area

# Example 4: Aggregate results
results_list = [
    {"key": "s1", "task_type": "background_change", "instruction_language": "en",
     "intersection_exist": True, "score": 0.85},
    {"key": "s2", "task_type": "color_alter", "instruction_language": "en",
     "intersection_exist": True, "score": 0.90},
    {"key": "s3", "task_type": "style_change", "instruction_language": "cn",
     "intersection_exist": False, "score": 0.75}
]

# Overall aggregation
overall_score = gedit_bench_aggregate_results(results_list)
# Returns: 0.8333 (average of all scores)
# Also logs breakdown by task_type, language, and intersection status

# English intersection only
en_intersection_score = gedit_bench_aggregate_en_intersection_semantics(results_list)
# Returns: 0.875 (average of s1 and s2 only)

# Chinese fullset only
cn_fullset_score = gedit_bench_aggregate_cn_fullset_overall(results_list)
# Returns: 0.75 (only s3)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment