Overview
Two-stage evaluation pipeline for StructEditBench that assesses structured visual editing quality through vision-QA followed by text-only LLM judging of answer correctness.
Description
This module implements the official StructEditBench evaluation protocol using a two-stage pipeline: (1) Vision-QA stage queries the edited image with verification questions to get model responses, (2) Judge stage uses a text-only LLM to evaluate if responses match ground truth. It calculates editing_accuracy (how well edits were applied), maintain_accuracy (preservation of unedited elements), and weighted_accuracy (0.9 * editing + 0.1 * maintain). Supports OpenAI-compatible APIs including local vLLM servers, and evaluates across six categories: chart, math, graph, puzzle, science, and table.
Usage
Use this when evaluating image editing models on StructEditBench benchmark. Configure via environment variables: STRUCTEDITBENCH_API_KEY (use "EMPTY" for local vLLM), STRUCTEDITBENCH_BASE_URL (API endpoint), STRUCTEDITBENCH_EVAL_MODEL_NAME (model for both stages), and optional STRUCTEDITBENCH_JUDGE_MODEL_NAME (separate judge model). Supports both inference-time evaluation and post-hoc evaluation of saved edited images.
Code Reference
Source Location
Signature
def structeditbench_doc_to_visual(doc: Dict) -> List[Image.Image]
def structeditbench_doc_to_text(
doc: Dict,
lmms_eval_specific_kwargs: Optional[Dict] = None
) -> str
def structeditbench_process_results(
doc: Dict,
results: List[Any],
**kwargs
) -> Dict[str, Dict]
def structeditbench_aggregate_score(results: List[Dict]) -> float
def structeditbench_aggregate_chart(results: List[Dict]) -> float
def structeditbench_aggregate_math(results: List[Dict]) -> float
def structeditbench_aggregate_graph(results: List[Dict]) -> float
def structeditbench_aggregate_puzzle(results: List[Dict]) -> float
def structeditbench_aggregate_science(results: List[Dict]) -> float
def structeditbench_aggregate_table(results: List[Dict]) -> float
def image_to_base64(image: Any) -> Optional[str]
Import
from lmms_eval.tasks.structeditbench.utils import (
structeditbench_doc_to_visual,
structeditbench_doc_to_text,
structeditbench_process_results,
structeditbench_aggregate_score
)
I/O Contract
Environment Variables
| Variable |
Type |
Description
|
| STRUCTEDITBENCH_API_KEY |
str |
API key (use "EMPTY" for local vLLM servers)
|
| STRUCTEDITBENCH_BASE_URL |
str |
API endpoint URL (e.g., http://localhost:8000/v1)
|
| STRUCTEDITBENCH_EVAL_MODEL_NAME |
str |
Model for both QA and judge stages (default: "default")
|
| STRUCTEDITBENCH_JUDGE_MODEL_NAME |
str |
Optional separate judge model
|
| STRUCTEDITBENCH_TIMEOUT |
int |
API timeout in seconds (default: 180)
|
| STRUCTEDITBENCH_MAX_RETRIES |
int |
Retry count for transient errors (default: 3)
|
| STRUCTEDITBENCH_CALL_DELAY |
float |
Delay between API calls in seconds (default: 0.5)
|
| STRUCTEDITBENCH_QA_MAX_TOKENS |
int |
Max tokens for QA responses (default: 128)
|
| STRUCTEDITBENCH_JUDGE_MAX_TOKENS |
int |
Max tokens for judge responses (default: 16)
|
| STRUCTEDITBENCH_MAX_QA |
int |
Optional cap on qa_list length
|
structeditbench_process_results Input
| Parameter |
Type |
Description
|
| doc |
Dict |
Dataset sample with keys: instruction, source_image, qa_list, category, key/id
|
| results |
List[Any] |
Model output with "images" field containing edited image path
|
structeditbench_process_results Output
| Metric Key |
Fields |
Description
|
| structeditbench_weighted_accuracy |
key, category, score, qa_results, edited_image_path, num_qa |
Weighted score (0.9*editing + 0.1*maintain)
|
| structeditbench_editing_accuracy |
key, category, score |
Accuracy on editing verification questions
|
| structeditbench_maintain_accuracy |
key, category, score |
Accuracy on preservation verification questions
|
| structeditbench_chart_weighted_accuracy |
key, category, score |
Per-category weighted score (chart)
|
| structeditbench_math_weighted_accuracy |
key, category, score |
Per-category weighted score (math)
|
| structeditbench_graph_weighted_accuracy |
key, category, score |
Per-category weighted score (graph)
|
| structeditbench_puzzle_weighted_accuracy |
key, category, score |
Per-category weighted score (puzzle)
|
| structeditbench_science_weighted_accuracy |
key, category, score |
Per-category weighted score (science)
|
| structeditbench_table_weighted_accuracy |
key, category, score |
Per-category weighted score (table)
|
Usage Examples
import os
from lmms_eval.tasks.structeditbench.utils import (
structeditbench_process_results,
structeditbench_aggregate_score
)
# Configure evaluation server (local vLLM example)
os.environ["STRUCTEDITBENCH_API_KEY"] = "EMPTY"
os.environ["STRUCTEDITBENCH_BASE_URL"] = "http://localhost:8000/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "Qwen/Qwen2-VL-7B-Instruct"
os.environ["STRUCTEDITBENCH_TIMEOUT"] = "300"
# Process single sample
doc = {
"key": "chart_001",
"category": "chart",
"instruction": "Change the bar color to red",
"source_image": pil_image,
"qa_list": [
{"question": "What color are the bars?", "ground_truth_answer": "red", "label": "editing"},
{"question": "Are the axis labels preserved?", "ground_truth_answer": "yes", "label": "maintain"}
]
}
# Model output with edited image path
results = [{"images": [{"path": "/path/to/edited_image.png"}]}]
scores = structeditbench_process_results(doc, results)
print(f"Weighted Accuracy: {scores['structeditbench_weighted_accuracy']['score']:.2f}%")
print(f"Editing Accuracy: {scores['structeditbench_editing_accuracy']['score']:.2f}%")
print(f"Maintain Accuracy: {scores['structeditbench_maintain_accuracy']['score']:.2f}%")
# Aggregate across dataset
all_results = [
{"key": "001", "category": "chart", "score": 85.0},
{"key": "002", "category": "chart", "score": 90.0},
{"key": "003", "category": "math", "score": 75.0}
]
overall_score = structeditbench_aggregate_score(all_results)
chart_score = structeditbench_aggregate_chart(all_results)
print(f"Overall: {overall_score:.2f}%, Chart: {chart_score:.2f}%")
# Use OpenAI API instead of local server
os.environ["STRUCTEDITBENCH_API_KEY"] = "sk-..."
os.environ["STRUCTEDITBENCH_BASE_URL"] = "https://api.openai.com/v1"
os.environ["STRUCTEDITBENCH_EVAL_MODEL_NAME"] = "gpt-4o"
os.environ["STRUCTEDITBENCH_JUDGE_MODEL_NAME"] = "gpt-4o-mini" # Use cheaper model for judging
Related Pages