Implementation:EvolvingLMMs Lab Lmms eval GEdit Bench Utils

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Computer Vision, Image Editing, Benchmark Evaluation
Last Updated	2026-02-14 00:00 GMT

Overview

Evaluation utilities for GEdit-Bench that assess image editing models using VIEScore across 11 task categories.

Description

This module provides the evaluation infrastructure for GEdit-Bench (General Editing Benchmark), which evaluates image editing models on 11 task types (background change, color alter, material alter, motion change, photoshop human, style change, subject add/remove/replace, text change, and tone transfer). It uses VIEScore (Visual Instruction Editing Score) to automatically assess edited images on three dimensions: semantics (instruction following), quality (visual fidelity), and overall score. The module handles document-to-input conversion, image resizing for evaluation, VIEScore computation, and comprehensive metric aggregation across multiple breakdowns (English/Chinese languages, fullset/intersection subsets, and task types).

Usage

Use this module when evaluating image editing models on GEdit-Bench. The workflow involves: (1) using gedit_bench_doc_to_visual/gedit_bench_doc_to_text to prepare inputs with editing instructions, (2) collecting model-edited images saved as JSON with image paths, (3) using gedit_bench_process_results to compute VIEScore metrics, (4) using aggregate functions to compute final scores across different breakdowns. The module automatically handles image resizing to 512x512 equivalent area for consistent evaluation.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/gedit_bench/utils.py
Lines: 1-384

Signature

# Document conversion functions
def gedit_bench_doc_to_visual(doc):
    """Extract input image from document."""

def gedit_bench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
    """Extract instruction text with optional pre/post prompts."""

def gedit_bench_doc_to_target(doc):
    """Extract target instruction for reference."""

# Result processing functions
def gedit_bench_process_results(doc, results, **kwargs):
    """Process model predictions and evaluate using VIEScore."""

def gedit_bench_aggregate_results(results):
    """Aggregate overall results with detailed breakdowns."""

# Language and subset specific aggregations
def gedit_bench_aggregate_en_fullset_semantics(results):
    """Aggregate English fullset semantics scores."""

def gedit_bench_aggregate_en_fullset_quality(results):
    """Aggregate English fullset quality scores."""

def gedit_bench_aggregate_en_fullset_overall(results):
    """Aggregate English fullset overall scores."""

def gedit_bench_aggregate_en_intersection_semantics(results):
    """Aggregate English intersection semantics scores."""

def gedit_bench_aggregate_en_intersection_quality(results):
    """Aggregate English intersection quality scores."""

def gedit_bench_aggregate_en_intersection_overall(results):
    """Aggregate English intersection overall scores."""

def gedit_bench_aggregate_cn_fullset_semantics(results):
    """Aggregate Chinese fullset semantics scores."""

def gedit_bench_aggregate_cn_fullset_quality(results):
    """Aggregate Chinese fullset quality scores."""

def gedit_bench_aggregate_cn_fullset_overall(results):
    """Aggregate Chinese fullset overall scores."""

def gedit_bench_aggregate_cn_intersection_semantics(results):
    """Aggregate Chinese intersection semantics scores."""

def gedit_bench_aggregate_cn_intersection_quality(results):
    """Aggregate Chinese intersection quality scores."""

def gedit_bench_aggregate_cn_intersection_overall(results):
    """Aggregate Chinese intersection overall scores."""

# Helper functions
def calculate_dimensions(target_area, ratio):
    """Calculate dimensions maintaining aspect ratio."""

def _aggregate_by_filter(results, language=None, intersection_only=None):
    """Helper to aggregate scores with language and subset filters."""

Import

from lmms_eval.tasks.gedit_bench.utils import (
    gedit_bench_doc_to_visual,
    gedit_bench_doc_to_text,
    gedit_bench_process_results,
    gedit_bench_aggregate_results,
    gedit_bench_aggregate_en_fullset_overall,
    gedit_bench_aggregate_cn_intersection_semantics
)

I/O Contract

Inputs

Name	Type	Required	Description
doc	dict	Yes	Document with input_image, instruction, key, task_type, instruction_language, Intersection_exist
results	list	Yes	List with single JSON string containing {"text": "...", "images": ["path.png"]}
lmms_eval_specific_kwargs	dict	No	Optional pre_prompt and post_prompt for instruction formatting
target_area	int	Yes	Target pixel area for resizing (typically 512*512)
ratio	float	Yes	Aspect ratio (width/height) of image

Outputs

Name	Type	Description
doc_to_visual return	list	List with single PIL Image in RGB format
doc_to_text return	str	Formatted instruction text
process_results return	dict	Dict with 15 metric keys (overall + language/subset breakdowns)
aggregate functions return	float	Score averaged across filtered samples
calculate_dimensions return	tuple	(width, height, new_area) maintaining aspect ratio

Usage Examples

# Example 1: Document processing
doc = {
    "key": "sample_001",
    "input_image": PIL.Image.open("input.jpg"),
    "instruction": "Change the background to a beach scene",
    "task_type": "background_change",
    "instruction_language": "en",
    "Intersection_exist": True
}

# Get visual input
visual = gedit_bench_doc_to_visual(doc)
# Returns: [<PIL.Image.Image in RGB>]

# Get instruction text
text = gedit_bench_doc_to_text(doc, {
    "pre_prompt": "Edit the image: ",
    "post_prompt": ""
})
# Returns: "Edit the image: Change the background to a beach scene"

# Example 2: Process model predictions with VIEScore
import json

# Model output (JSON format)
model_output = json.dumps({
    "text": "Image edited successfully",
    "images": ["./output/gedit_bench/sample_001.png"]
})

results = [model_output]
processed = gedit_bench_process_results(doc, results)
# Returns dict with 15 keys:
# - gedit_bench_semantics_score
# - gedit_bench_quality_score
# - gedit_bench_overall_score
# - gedit_bench_en_fullset_semantics
# - gedit_bench_en_fullset_quality
# - gedit_bench_en_fullset_overall
# - gedit_bench_en_intersection_semantics
# - gedit_bench_en_intersection_quality
# - gedit_bench_en_intersection_overall
# - gedit_bench_cn_fullset_semantics
# - gedit_bench_cn_fullset_quality
# - gedit_bench_cn_fullset_overall
# - gedit_bench_cn_intersection_semantics
# - gedit_bench_cn_intersection_quality
# - gedit_bench_cn_intersection_overall

# Each value contains: {key, task_type, instruction_language, intersection_exist, score}

# Example 3: Calculate dimensions for resizing
from lmms_eval.tasks.gedit_bench.utils import calculate_dimensions

width, height, area = calculate_dimensions(
    target_area=512 * 512,  # 262144 pixels
    ratio=16/9  # Aspect ratio
)
# Returns: (682, 383, 261106) - maintains 16:9 ratio close to 512x512 area

# Example 4: Aggregate results
results_list = [
    {"key": "s1", "task_type": "background_change", "instruction_language": "en",
     "intersection_exist": True, "score": 0.85},
    {"key": "s2", "task_type": "color_alter", "instruction_language": "en",
     "intersection_exist": True, "score": 0.90},
    {"key": "s3", "task_type": "style_change", "instruction_language": "cn",
     "intersection_exist": False, "score": 0.75}
]

# Overall aggregation
overall_score = gedit_bench_aggregate_results(results_list)
# Returns: 0.8333 (average of all scores)
# Also logs breakdown by task_type, language, and intersection status

# English intersection only
en_intersection_score = gedit_bench_aggregate_en_intersection_semantics(results_list)
# Returns: 0.875 (average of s1 and s2 only)

# Chinese fullset only
cn_fullset_score = gedit_bench_aggregate_cn_fullset_overall(results_list)
# Returns: 0.75 (only s3)

Related Pages

Principle:EvolvingLMMs_Lab_Lmms_eval_Task_Utility_Functions

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment