Implementation:EvolvingLMMs Lab Lmms eval StructEditBench 2Stage Eval

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Vision, Evaluation, Image_Editing
Last Updated	2026-02-14 00:00 GMT

Overview

Two-stage evaluation pipeline for StructEditBench that assesses structured visual editing quality through vision-QA followed by text-only LLM judging of answer correctness.

Description

This module implements the official StructEditBench evaluation protocol using a two-stage pipeline: (1) Vision-QA stage queries the edited image with verification questions to get model responses, (2) Judge stage uses a text-only LLM to evaluate if responses match ground truth. It calculates editing_accuracy (how well edits were applied), maintain_accuracy (preservation of unedited elements), and weighted_accuracy (0.9 * editing + 0.1 * maintain). Supports OpenAI-compatible APIs including local vLLM servers, and evaluates across six categories: chart, math, graph, puzzle, science, and table.

Usage

Use this when evaluating image editing models on StructEditBench benchmark. Configure via environment variables: STRUCTEDITBENCH_API_KEY (use "EMPTY" for local vLLM), STRUCTEDITBENCH_BASE_URL (API endpoint), STRUCTEDITBENCH_EVAL_MODEL_NAME (model for both stages), and optional STRUCTEDITBENCH_JUDGE_MODEL_NAME (separate judge model). Supports both inference-time evaluation and post-hoc evaluation of saved edited images.

Code Reference

Source Location

Repository: EvolvingLMMs_Lab_Lmms_eval
File: lmms_eval/tasks/structeditbench/utils.py

Signature

def structeditbench_doc_to_visual(doc: Dict) -> List[Image.Image]
def structeditbench_doc_to_text(
    doc: Dict,
    lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str

def structeditbench_process_results(
    doc: Dict,
    results: List[Any],
    **kwargs
) -> Dict[str, Dict]

def structeditbench_aggregate_score(results: List[Dict]) -> float
def structeditbench_aggregate_chart(results: List[Dict]) -> float
def structeditbench_aggregate_math(results: List[Dict]) -> float
def structeditbench_aggregate_graph(results: List[Dict]) -> float
def structeditbench_aggregate_puzzle(results: List[Dict]) -> float
def structeditbench_aggregate_science(results: List[Dict]) -> float
def structeditbench_aggregate_table(results: List[Dict]) -> float

def image_to_base64(image: Any) -> Optional[str]

Import

from lmms_eval.tasks.structeditbench.utils import (
    structeditbench_doc_to_visual,
    structeditbench_doc_to_text,
    structeditbench_process_results,
    structeditbench_aggregate_score
)

I/O Contract

Environment Variables

Variable	Type	Description
STRUCTEDITBENCH_API_KEY	str	API key (use "EMPTY" for local vLLM servers)
STRUCTEDITBENCH_BASE_URL	str	API endpoint URL (e.g., http://localhost:8000/v1)
STRUCTEDITBENCH_EVAL_MODEL_NAME	str	Model for both QA and judge stages (default: "default")
STRUCTEDITBENCH_JUDGE_MODEL_NAME	str	Optional separate judge model
STRUCTEDITBENCH_TIMEOUT	int	API timeout in seconds (default: 180)
STRUCTEDITBENCH_MAX_RETRIES	int	Retry count for transient errors (default: 3)
STRUCTEDITBENCH_CALL_DELAY	float	Delay between API calls in seconds (default: 0.5)
STRUCTEDITBENCH_QA_MAX_TOKENS	int	Max tokens for QA responses (default: 128)
STRUCTEDITBENCH_JUDGE_MAX_TOKENS	int	Max tokens for judge responses (default: 16)
STRUCTEDITBENCH_MAX_QA	int	Optional cap on qa_list length

structeditbench_process_results Input

Parameter	Type	Description
doc	Dict	Dataset sample with keys: instruction, source_image, qa_list, category, key/id
results	List[Any]	Model output with "images" field containing edited image path

structeditbench_process_results Output

Metric Key	Fields	Description
structeditbench_weighted_accuracy	key, category, score, qa_results, edited_image_path, num_qa	Weighted score (0.9editing + 0.1maintain)
structeditbench_editing_accuracy	key, category, score	Accuracy on editing verification questions
structeditbench_maintain_accuracy	key, category, score	Accuracy on preservation verification questions
structeditbench_chart_weighted_accuracy	key, category, score	Per-category weighted score (chart)
structeditbench_math_weighted_accuracy	key, category, score	Per-category weighted score (math)
structeditbench_graph_weighted_accuracy	key, category, score	Per-category weighted score (graph)
structeditbench_puzzle_weighted_accuracy	key, category, score	Per-category weighted score (puzzle)
structeditbench_science_weighted_accuracy	key, category, score	Per-category weighted score (science)
structeditbench_table_weighted_accuracy	key, category, score	Per-category weighted score (table)

Usage Examples

import os
from lmms_eval.tasks.structeditbench.utils import (
    structeditbench_process_results,
    structeditbench_aggregate_score
)

# Configure evaluation server (local vLLM example)
os.environ["STRUCTEDITBENCH_API_KEY"] = "EMPTY"
os.environ["STRUCTEDITBENCH_BASE_URL"] = "http://localhost:8000/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "Qwen/Qwen2-VL-7B-Instruct"
os.environ["STRUCTEDITBENCH_TIMEOUT"] = "300"

# Process single sample
doc = {
    "key": "chart_001",
    "category": "chart",
    "instruction": "Change the bar color to red",
    "source_image": pil_image,
    "qa_list": [
        {"question": "What color are the bars?", "ground_truth_answer": "red", "label": "editing"},
        {"question": "Are the axis labels preserved?", "ground_truth_answer": "yes", "label": "maintain"}
    ]
}

# Model output with edited image path
results = [{"images": [{"path": "/path/to/edited_image.png"}]}]

scores = structeditbench_process_results(doc, results)
print(f"Weighted Accuracy: {scores['structeditbench_weighted_accuracy']['score']:.2f}%")
print(f"Editing Accuracy: {scores['structeditbench_editing_accuracy']['score']:.2f}%")
print(f"Maintain Accuracy: {scores['structeditbench_maintain_accuracy']['score']:.2f}%")

# Aggregate across dataset
all_results = [
    {"key": "001", "category": "chart", "score": 85.0},
    {"key": "002", "category": "chart", "score": 90.0},
    {"key": "003", "category": "math", "score": 75.0}
]
overall_score = structeditbench_aggregate_score(all_results)
chart_score = structeditbench_aggregate_chart(all_results)
print(f"Overall: {overall_score:.2f}%, Chart: {chart_score:.2f}%")

# Use OpenAI API instead of local server
os.environ["STRUCTEDITBENCH_API_KEY"] = "sk-..."
os.environ["STRUCTEDITBENCH_BASE_URL"] = "https://api.openai.com/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "gpt-4o"
os.environ["STRUCTEDITBENCH_JUDGE_MODEL_NAME"] = "gpt-4o-mini"  # Use cheaper model for judging

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment