Implementation:EvolvingLMMs Lab Lmms eval GEdit Bench Utils
| Knowledge Sources | |
|---|---|
| Domains | Computer Vision, Image Editing, Benchmark Evaluation |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation utilities for GEdit-Bench that assess image editing models using VIEScore across 11 task categories.
Description
This module provides the evaluation infrastructure for GEdit-Bench (General Editing Benchmark), which evaluates image editing models on 11 task types (background change, color alter, material alter, motion change, photoshop human, style change, subject add/remove/replace, text change, and tone transfer). It uses VIEScore (Visual Instruction Editing Score) to automatically assess edited images on three dimensions: semantics (instruction following), quality (visual fidelity), and overall score. The module handles document-to-input conversion, image resizing for evaluation, VIEScore computation, and comprehensive metric aggregation across multiple breakdowns (English/Chinese languages, fullset/intersection subsets, and task types).
Usage
Use this module when evaluating image editing models on GEdit-Bench. The workflow involves: (1) using gedit_bench_doc_to_visual/gedit_bench_doc_to_text to prepare inputs with editing instructions, (2) collecting model-edited images saved as JSON with image paths, (3) using gedit_bench_process_results to compute VIEScore metrics, (4) using aggregate functions to compute final scores across different breakdowns. The module automatically handles image resizing to 512x512 equivalent area for consistent evaluation.
Code Reference
Source Location
- Repository: EvolvingLMMs_Lab_Lmms_eval
- File: lmms_eval/tasks/gedit_bench/utils.py
- Lines: 1-384
Signature
# Document conversion functions
def gedit_bench_doc_to_visual(doc):
"""Extract input image from document."""
def gedit_bench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
"""Extract instruction text with optional pre/post prompts."""
def gedit_bench_doc_to_target(doc):
"""Extract target instruction for reference."""
# Result processing functions
def gedit_bench_process_results(doc, results, **kwargs):
"""Process model predictions and evaluate using VIEScore."""
def gedit_bench_aggregate_results(results):
"""Aggregate overall results with detailed breakdowns."""
# Language and subset specific aggregations
def gedit_bench_aggregate_en_fullset_semantics(results):
"""Aggregate English fullset semantics scores."""
def gedit_bench_aggregate_en_fullset_quality(results):
"""Aggregate English fullset quality scores."""
def gedit_bench_aggregate_en_fullset_overall(results):
"""Aggregate English fullset overall scores."""
def gedit_bench_aggregate_en_intersection_semantics(results):
"""Aggregate English intersection semantics scores."""
def gedit_bench_aggregate_en_intersection_quality(results):
"""Aggregate English intersection quality scores."""
def gedit_bench_aggregate_en_intersection_overall(results):
"""Aggregate English intersection overall scores."""
def gedit_bench_aggregate_cn_fullset_semantics(results):
"""Aggregate Chinese fullset semantics scores."""
def gedit_bench_aggregate_cn_fullset_quality(results):
"""Aggregate Chinese fullset quality scores."""
def gedit_bench_aggregate_cn_fullset_overall(results):
"""Aggregate Chinese fullset overall scores."""
def gedit_bench_aggregate_cn_intersection_semantics(results):
"""Aggregate Chinese intersection semantics scores."""
def gedit_bench_aggregate_cn_intersection_quality(results):
"""Aggregate Chinese intersection quality scores."""
def gedit_bench_aggregate_cn_intersection_overall(results):
"""Aggregate Chinese intersection overall scores."""
# Helper functions
def calculate_dimensions(target_area, ratio):
"""Calculate dimensions maintaining aspect ratio."""
def _aggregate_by_filter(results, language=None, intersection_only=None):
"""Helper to aggregate scores with language and subset filters."""
Import
from lmms_eval.tasks.gedit_bench.utils import (
gedit_bench_doc_to_visual,
gedit_bench_doc_to_text,
gedit_bench_process_results,
gedit_bench_aggregate_results,
gedit_bench_aggregate_en_fullset_overall,
gedit_bench_aggregate_cn_intersection_semantics
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| doc | dict | Yes | Document with input_image, instruction, key, task_type, instruction_language, Intersection_exist |
| results | list | Yes | List with single JSON string containing {"text": "...", "images": ["path.png"]} |
| lmms_eval_specific_kwargs | dict | No | Optional pre_prompt and post_prompt for instruction formatting |
| target_area | int | Yes | Target pixel area for resizing (typically 512*512) |
| ratio | float | Yes | Aspect ratio (width/height) of image |
Outputs
| Name | Type | Description |
|---|---|---|
| doc_to_visual return | list | List with single PIL Image in RGB format |
| doc_to_text return | str | Formatted instruction text |
| process_results return | dict | Dict with 15 metric keys (overall + language/subset breakdowns) |
| aggregate functions return | float | Score averaged across filtered samples |
| calculate_dimensions return | tuple | (width, height, new_area) maintaining aspect ratio |
Usage Examples
# Example 1: Document processing
doc = {
"key": "sample_001",
"input_image": PIL.Image.open("input.jpg"),
"instruction": "Change the background to a beach scene",
"task_type": "background_change",
"instruction_language": "en",
"Intersection_exist": True
}
# Get visual input
visual = gedit_bench_doc_to_visual(doc)
# Returns: [<PIL.Image.Image in RGB>]
# Get instruction text
text = gedit_bench_doc_to_text(doc, {
"pre_prompt": "Edit the image: ",
"post_prompt": ""
})
# Returns: "Edit the image: Change the background to a beach scene"
# Example 2: Process model predictions with VIEScore
import json
# Model output (JSON format)
model_output = json.dumps({
"text": "Image edited successfully",
"images": ["./output/gedit_bench/sample_001.png"]
})
results = [model_output]
processed = gedit_bench_process_results(doc, results)
# Returns dict with 15 keys:
# - gedit_bench_semantics_score
# - gedit_bench_quality_score
# - gedit_bench_overall_score
# - gedit_bench_en_fullset_semantics
# - gedit_bench_en_fullset_quality
# - gedit_bench_en_fullset_overall
# - gedit_bench_en_intersection_semantics
# - gedit_bench_en_intersection_quality
# - gedit_bench_en_intersection_overall
# - gedit_bench_cn_fullset_semantics
# - gedit_bench_cn_fullset_quality
# - gedit_bench_cn_fullset_overall
# - gedit_bench_cn_intersection_semantics
# - gedit_bench_cn_intersection_quality
# - gedit_bench_cn_intersection_overall
# Each value contains: {key, task_type, instruction_language, intersection_exist, score}
# Example 3: Calculate dimensions for resizing
from lmms_eval.tasks.gedit_bench.utils import calculate_dimensions
width, height, area = calculate_dimensions(
target_area=512 * 512, # 262144 pixels
ratio=16/9 # Aspect ratio
)
# Returns: (682, 383, 261106) - maintains 16:9 ratio close to 512x512 area
# Example 4: Aggregate results
results_list = [
{"key": "s1", "task_type": "background_change", "instruction_language": "en",
"intersection_exist": True, "score": 0.85},
{"key": "s2", "task_type": "color_alter", "instruction_language": "en",
"intersection_exist": True, "score": 0.90},
{"key": "s3", "task_type": "style_change", "instruction_language": "cn",
"intersection_exist": False, "score": 0.75}
]
# Overall aggregation
overall_score = gedit_bench_aggregate_results(results_list)
# Returns: 0.8333 (average of all scores)
# Also logs breakdown by task_type, language, and intersection status
# English intersection only
en_intersection_score = gedit_bench_aggregate_en_intersection_semantics(results_list)
# Returns: 0.875 (average of s1 and s2 only)
# Chinese fullset only
cn_fullset_score = gedit_bench_aggregate_cn_fullset_overall(results_list)
# Returns: 0.75 (only s3)