Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:EvolvingLMMs Lab Lmms eval StructEditBench 2Stage Eval

From Leeroopedia
Knowledge Sources
Domains Vision, Evaluation, Image_Editing
Last Updated 2026-02-14 00:00 GMT

Overview

Two-stage evaluation pipeline for StructEditBench that assesses structured visual editing quality through vision-QA followed by text-only LLM judging of answer correctness.

Description

This module implements the official StructEditBench evaluation protocol using a two-stage pipeline: (1) Vision-QA stage queries the edited image with verification questions to get model responses, (2) Judge stage uses a text-only LLM to evaluate if responses match ground truth. It calculates editing_accuracy (how well edits were applied), maintain_accuracy (preservation of unedited elements), and weighted_accuracy (0.9 * editing + 0.1 * maintain). Supports OpenAI-compatible APIs including local vLLM servers, and evaluates across six categories: chart, math, graph, puzzle, science, and table.

Usage

Use this when evaluating image editing models on StructEditBench benchmark. Configure via environment variables: STRUCTEDITBENCH_API_KEY (use "EMPTY" for local vLLM), STRUCTEDITBENCH_BASE_URL (API endpoint), STRUCTEDITBENCH_EVAL_MODEL_NAME (model for both stages), and optional STRUCTEDITBENCH_JUDGE_MODEL_NAME (separate judge model). Supports both inference-time evaluation and post-hoc evaluation of saved edited images.

Code Reference

Source Location

Signature

def structeditbench_doc_to_visual(doc: Dict) -> List[Image.Image]
def structeditbench_doc_to_text(
    doc: Dict,
    lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str

def structeditbench_process_results(
    doc: Dict,
    results: List[Any],
    **kwargs
) -> Dict[str, Dict]

def structeditbench_aggregate_score(results: List[Dict]) -> float
def structeditbench_aggregate_chart(results: List[Dict]) -> float
def structeditbench_aggregate_math(results: List[Dict]) -> float
def structeditbench_aggregate_graph(results: List[Dict]) -> float
def structeditbench_aggregate_puzzle(results: List[Dict]) -> float
def structeditbench_aggregate_science(results: List[Dict]) -> float
def structeditbench_aggregate_table(results: List[Dict]) -> float

def image_to_base64(image: Any) -> Optional[str]

Import

from lmms_eval.tasks.structeditbench.utils import (
    structeditbench_doc_to_visual,
    structeditbench_doc_to_text,
    structeditbench_process_results,
    structeditbench_aggregate_score
)

I/O Contract

Environment Variables

Variable Type Description
STRUCTEDITBENCH_API_KEY str API key (use "EMPTY" for local vLLM servers)
STRUCTEDITBENCH_BASE_URL str API endpoint URL (e.g., http://localhost:8000/v1)
STRUCTEDITBENCH_EVAL_MODEL_NAME str Model for both QA and judge stages (default: "default")
STRUCTEDITBENCH_JUDGE_MODEL_NAME str Optional separate judge model
STRUCTEDITBENCH_TIMEOUT int API timeout in seconds (default: 180)
STRUCTEDITBENCH_MAX_RETRIES int Retry count for transient errors (default: 3)
STRUCTEDITBENCH_CALL_DELAY float Delay between API calls in seconds (default: 0.5)
STRUCTEDITBENCH_QA_MAX_TOKENS int Max tokens for QA responses (default: 128)
STRUCTEDITBENCH_JUDGE_MAX_TOKENS int Max tokens for judge responses (default: 16)
STRUCTEDITBENCH_MAX_QA int Optional cap on qa_list length

structeditbench_process_results Input

Parameter Type Description
doc Dict Dataset sample with keys: instruction, source_image, qa_list, category, key/id
results List[Any] Model output with "images" field containing edited image path

structeditbench_process_results Output

Metric Key Fields Description
structeditbench_weighted_accuracy key, category, score, qa_results, edited_image_path, num_qa Weighted score (0.9*editing + 0.1*maintain)
structeditbench_editing_accuracy key, category, score Accuracy on editing verification questions
structeditbench_maintain_accuracy key, category, score Accuracy on preservation verification questions
structeditbench_chart_weighted_accuracy key, category, score Per-category weighted score (chart)
structeditbench_math_weighted_accuracy key, category, score Per-category weighted score (math)
structeditbench_graph_weighted_accuracy key, category, score Per-category weighted score (graph)
structeditbench_puzzle_weighted_accuracy key, category, score Per-category weighted score (puzzle)
structeditbench_science_weighted_accuracy key, category, score Per-category weighted score (science)
structeditbench_table_weighted_accuracy key, category, score Per-category weighted score (table)

Usage Examples

import os
from lmms_eval.tasks.structeditbench.utils import (
    structeditbench_process_results,
    structeditbench_aggregate_score
)

# Configure evaluation server (local vLLM example)
os.environ["STRUCTEDITBENCH_API_KEY"] = "EMPTY"
os.environ["STRUCTEDITBENCH_BASE_URL"] = "http://localhost:8000/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "Qwen/Qwen2-VL-7B-Instruct"
os.environ["STRUCTEDITBENCH_TIMEOUT"] = "300"

# Process single sample
doc = {
    "key": "chart_001",
    "category": "chart",
    "instruction": "Change the bar color to red",
    "source_image": pil_image,
    "qa_list": [
        {"question": "What color are the bars?", "ground_truth_answer": "red", "label": "editing"},
        {"question": "Are the axis labels preserved?", "ground_truth_answer": "yes", "label": "maintain"}
    ]
}

# Model output with edited image path
results = [{"images": [{"path": "/path/to/edited_image.png"}]}]

scores = structeditbench_process_results(doc, results)
print(f"Weighted Accuracy: {scores['structeditbench_weighted_accuracy']['score']:.2f}%")
print(f"Editing Accuracy: {scores['structeditbench_editing_accuracy']['score']:.2f}%")
print(f"Maintain Accuracy: {scores['structeditbench_maintain_accuracy']['score']:.2f}%")

# Aggregate across dataset
all_results = [
    {"key": "001", "category": "chart", "score": 85.0},
    {"key": "002", "category": "chart", "score": 90.0},
    {"key": "003", "category": "math", "score": 75.0}
]
overall_score = structeditbench_aggregate_score(all_results)
chart_score = structeditbench_aggregate_chart(all_results)
print(f"Overall: {overall_score:.2f}%, Chart: {chart_score:.2f}%")

# Use OpenAI API instead of local server
os.environ["STRUCTEDITBENCH_API_KEY"] = "sk-..."
os.environ["STRUCTEDITBENCH_BASE_URL"] = "https://api.openai.com/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "gpt-4o"
os.environ["STRUCTEDITBENCH_JUDGE_MODEL_NAME"] = "gpt-4o-mini"  # Use cheaper model for judging

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment