Implementation:EvolvingLMMs Lab Lmms eval Regression Testing
File: `/tmp/kapso_repo_sslb_59s/tools/regression.py`
Principle: Regression_Testing_Principle
Overview
The Regression Testing tool automates performance comparison across git branches for vision-language models. It evaluates models on specified tasks, compares results between branches, and generates markdown tables showing performance differences and runtime comparisons.
Configuration
Model and Task Setup
model_types = ["llava_onevision"]
vision_models = [
"lmms-lab/llava-onevision-qwen2-0.5b-ov",
]
single_image_tasks = ["ocrbench", "mmmu_val", "ai2d"]
multi_image_tasks = ["muirbench"]
video_tasks = ["videomme"]
task_names = single_image_tasks + multi_image_tasks + video_tasks
Default configuration includes:
- Single model type (llava_onevision)
- One test model
- Mix of single-image, multi-image, and video tasks
Argument Parsing
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--branches", default=[])
parser.add_argument("--models", default=vision_models)
parser.add_argument("--tasks", default=task_names)
parser.add_argument("--acc_norm", type=bool, default=False)
parser.add_argument("--perplexity", default=None)
parser.add_argument("--num_fewshot", type=int, default=0)
parser.add_argument("--limit", type=float, default=8)
parser.add_argument("--model", default="llava_onevision")
parser.add_argument("--model_args", default="conv_template=qwen_1_5,model_name=llava_qwen")
parser.add_argument("--batch_size", default="1")
return parser.parse_args()
Core Functions
Model Evaluation
def eval_models(args, branch=None):
if branch is not None:
if os.system(f"git checkout {branch}") != 0:
return {}, 0
branch = branch or initial_branch
start_time = time.time()
results = {}
for indx, model in enumerate(args.models):
model_type = model_types[indx]
model_args = f"pretrained={model},{args.model_args}"
tasks = args.tasks
batch_size = args.batch_size
output_path = f"logs/regression_test/{int(start_time)}-{branch.replace('/', '_')}"
original_dir = os.getcwd()
os.chdir(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
command = (
f"python3 -m accelerate.commands.launch --main_process_port=12580 --num_processes=8 lmms_eval "
f"--model {model_type} --model_args {model_args} --tasks {','.join(tasks)} "
f"--num_fewshot {args.num_fewshot}{'' if args.limit is None else f' --limit {args.limit}'} "
f"--batch_size {batch_size} --output_path {output_path}"
)
print(f"{'=' * 80}\nEvaluating {model} on {', '.join(tasks)} at {branch} with:\n\n{command}\n{'=' * 80}")
ret = os.system(command)
os.chdir(original_dir)
json_file_path = find_json_file(output_path)
if json_file_path and ret == 0:
with open(json_file_path, encoding="utf-8") as f:
results[model] = json.load(f)
else:
results[model] = {"results": {}}
end_time = time.time()
return results, end_time - start_time
The evaluation function: 1. Checks out the specified branch if provided 2. Tracks timing for runtime comparison 3. Constructs accelerate launch command with 8 processes 4. Changes to project root directory before running 5. Locates and loads result JSON files 6. Returns results dict and runtime
Result Extraction
def extract_value(args, results, model, task, err=False):
if model not in results:
return 0
results = results[model]["results"]
if task not in results:
return 0
results = results[task]
if task == "ai2d":
return results["exact_match,flexible-extract"]
elif task == "mmmu_val":
return results["mmmu_acc,none"]
elif task == "ocrbench":
return results["ocrbench_accuracy,none"]
elif task == "videomme":
return results["videomme_perception_score,none"]
elif task == "muirbench":
return results["muirbench_score_overall,flexible-extract"]
return 0
Task-specific metric extraction maps each task to its primary metric key.
Formatting Functions
def format_value(args, results, model, task):
val = 100 * extract_value(args, results, model, task)
err = 100 * extract_value(args, results, model, task, err=True)
return f"{val:.2f}{f' ± {err:.2f}' if err != 0 else ''}"
def format_diff(args, results1, results2, model, task):
val1 = 100 * extract_value(args, results1, model, task)
val2 = 100 * extract_value(args, results2, model, task)
diff = val2 - val1
return f"**+{diff:.2f}**" if diff > 0 else f"{diff:.2f}"
Formatting utilities:
- Convert values to percentages
- Format with optional error margins
- Highlight positive differences with bold markdown
JSON File Discovery
def find_json_file(base_path):
pattern = os.path.join(base_path, "**", "*_results.json")
json_files = glob.glob(pattern, recursive=True)
return json_files[0] if json_files else None
Recursively searches for result JSON files in the output directory.
Main Workflow
def main():
args = parse_args()
args.branches = args.branches.split(",") if isinstance(args.branches, str) else args.branches
args.models = args.models.split(",") if isinstance(args.models, str) else args.models
args.tasks = ALL_TASKS if args.tasks == "all_tasks" else utils.pattern_match(args.tasks.split(","), ALL_TASKS) if isinstance(args.tasks, str) else args.tasks
global initial_branch
initial_branch = subprocess.check_output("git branch --show-current", shell=True).decode("ascii").strip()
results, runtime = eval_models(args)
print(results, runtime)
runs = []
for branch in args.branches:
runs.append((branch, *eval_models(args, branch)))
os.system(f"git checkout {initial_branch}")
print("")
print(f"|task|{'|'.join(map(lambda model: Path(model).name, args.models))}|")
print(f"|--|{'--|' * len(args.models)}")
for task in args.tasks:
print(f"|{task} ({initial_branch})|{'|'.join(map(lambda model: format_value(args, results, model, task), args.models))}|")
for branch, branch_results, branch_runtime in runs:
print(f"|{task} ({branch})|{'|'.join(map(lambda model: format_value(args, branch_results, model, task), args.models))}|")
print(f"|{task} (diff)|{'|'.join(map(lambda model: format_diff(args, results, branch_results, model, task), args.models))}|")
print("")
print("|branch|runtime|%|")
print("|--|--|--|")
print(f"|{initial_branch}|{runtime:.1f}s|100%|")
for branch, _, branch_runtime in runs:
print(f"|{branch}|{branch_runtime:.1f}s|{100 * branch_runtime / runtime:.2f}%|")
if __name__ == "__main__":
main()
The main workflow: 1. Parses command-line arguments 2. Normalizes branches, models, and tasks (supports comma-separated lists and "all_tasks") 3. Captures the initial git branch 4. Runs evaluation on current branch (baseline) 5. Iterates through comparison branches, running evaluations on each 6. Returns to initial branch 7. Generates two markdown tables:
* Performance table: Shows scores for each task on each branch and differences * Runtime table: Compares execution time across branches
Output Format
Performance Table
``` |task|model_name| |--|--| |ai2d (main)|45.23| |ai2d (feature-branch)|46.15| |ai2d (diff)|**+0.92**| |mmmu_val (main)|38.50| |mmmu_val (feature-branch)|38.50| |mmmu_val (diff)|0.00| ```
Runtime Table
``` |branch|runtime|%| |--|--|--| |main|1234.5s|100%| |feature-branch|1189.2s|96.33%| ```
Design Patterns
Git Branch Testing
The tool automatically switches between branches to evaluate code changes, making it suitable for pre-merge regression testing.
Multi-Process Evaluation
Uses accelerate with 8 processes for parallel evaluation, significantly reducing test time.
Baseline Comparison
Always evaluates the current branch first as the baseline, then compares other branches against it.
Markdown Output
Generates markdown tables suitable for GitHub issues/PRs, with bold formatting for positive differences.
Task-Specific Metrics
The extraction logic handles different metric names for different tasks, abstracting away the heterogeneity.
Automatic Cleanup
Returns to the initial branch after testing, ensuring the repository state is preserved.
Usage Example
# Compare current branch against feature-branch
python tools/regression.py --branches feature-branch
# Test multiple branches
python tools/regression.py --branches feature-1,feature-2,feature-3
# Custom model and tasks
python tools/regression.py \
--branches optimize-inference \
--models lmms-lab/llava-onevision-qwen2-7b-ov \
--tasks ocrbench,mmmu_val \
--limit 100
Evaluation Command
The constructed command uses:
- accelerate: Distributed evaluation framework
- 8 processes: Parallel inference
- Port 12580: Fixed port to avoid conflicts
- Output path: Timestamped logs in `logs/regression_test/`
Limitations
- Model indexing: Assumes `model_types[indx]` aligns with `args.models[indx]`
- Single metric per task: Only extracts one primary metric per task
- No error handling: If evaluation fails, returns empty results
- In-place branch switching: Modifies working directory state
Dependencies
- argparse: Command-line parsing
- glob: File pattern matching
- json: Result file parsing
- os: System commands and file operations
- subprocess: Git command execution
- time: Runtime measurement
- pathlib: Path manipulation
- lmms_eval: Task registry and utilities
Related Components
- Principle: Regression_Testing_Principle
- Evaluation entrypoint: `lmms_eval/__main__.py`
- Task registry: `lmms_eval/api/registry.py`
Attribution
The code is adapted from EleutherAI's LM Evaluation Harness regression testing script.