Implementation:EvolvingLMMs Lab Lmms eval Regression Testing

File: `/tmp/kapso_repo_sslb_59s/tools/regression.py`

Principle: Regression_Testing_Principle

Overview

The Regression Testing tool automates performance comparison across git branches for vision-language models. It evaluates models on specified tasks, compares results between branches, and generates markdown tables showing performance differences and runtime comparisons.

Configuration

Model and Task Setup

model_types = ["llava_onevision"]
vision_models = [
    "lmms-lab/llava-onevision-qwen2-0.5b-ov",
]

single_image_tasks = ["ocrbench", "mmmu_val", "ai2d"]
multi_image_tasks = ["muirbench"]
video_tasks = ["videomme"]
task_names = single_image_tasks + multi_image_tasks + video_tasks

Default configuration includes:

Single model type (llava_onevision)
One test model
Mix of single-image, multi-image, and video tasks

Argument Parsing

def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--branches", default=[])
    parser.add_argument("--models", default=vision_models)
    parser.add_argument("--tasks", default=task_names)
    parser.add_argument("--acc_norm", type=bool, default=False)
    parser.add_argument("--perplexity", default=None)
    parser.add_argument("--num_fewshot", type=int, default=0)
    parser.add_argument("--limit", type=float, default=8)
    parser.add_argument("--model", default="llava_onevision")
    parser.add_argument("--model_args", default="conv_template=qwen_1_5,model_name=llava_qwen")
    parser.add_argument("--batch_size", default="1")
    return parser.parse_args()

Core Functions

Model Evaluation

def eval_models(args, branch=None):
    if branch is not None:
        if os.system(f"git checkout {branch}") != 0:
            return {}, 0

    branch = branch or initial_branch

    start_time = time.time()

    results = {}

    for indx, model in enumerate(args.models):
        model_type = model_types[indx]
        model_args = f"pretrained={model},{args.model_args}"
        tasks = args.tasks
        batch_size = args.batch_size
        output_path = f"logs/regression_test/{int(start_time)}-{branch.replace('/', '_')}"

        original_dir = os.getcwd()
        os.chdir(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))

        command = (
            f"python3 -m accelerate.commands.launch --main_process_port=12580 --num_processes=8 lmms_eval "
            f"--model {model_type} --model_args {model_args} --tasks {','.join(tasks)} "
            f"--num_fewshot {args.num_fewshot}{'' if args.limit is None else f' --limit {args.limit}'} "
            f"--batch_size {batch_size} --output_path {output_path}"
        )

        print(f"{'=' * 80}\nEvaluating {model} on {', '.join(tasks)} at {branch} with:\n\n{command}\n{'=' * 80}")

        ret = os.system(command)
        os.chdir(original_dir)

        json_file_path = find_json_file(output_path)

        if json_file_path and ret == 0:
            with open(json_file_path, encoding="utf-8") as f:
                results[model] = json.load(f)
        else:
            results[model] = {"results": {}}

    end_time = time.time()

    return results, end_time - start_time

The evaluation function: 1. Checks out the specified branch if provided 2. Tracks timing for runtime comparison 3. Constructs accelerate launch command with 8 processes 4. Changes to project root directory before running 5. Locates and loads result JSON files 6. Returns results dict and runtime

Result Extraction

def extract_value(args, results, model, task, err=False):
    if model not in results:
        return 0
    results = results[model]["results"]
    if task not in results:
        return 0
    results = results[task]
    if task == "ai2d":
        return results["exact_match,flexible-extract"]
    elif task == "mmmu_val":
        return results["mmmu_acc,none"]
    elif task == "ocrbench":
        return results["ocrbench_accuracy,none"]
    elif task == "videomme":
        return results["videomme_perception_score,none"]
    elif task == "muirbench":
        return results["muirbench_score_overall,flexible-extract"]
    return 0

Task-specific metric extraction maps each task to its primary metric key.

Formatting Functions

def format_value(args, results, model, task):
    val = 100 * extract_value(args, results, model, task)
    err = 100 * extract_value(args, results, model, task, err=True)
    return f"{val:.2f}{f' ± {err:.2f}' if err != 0 else ''}"

def format_diff(args, results1, results2, model, task):
    val1 = 100 * extract_value(args, results1, model, task)
    val2 = 100 * extract_value(args, results2, model, task)
    diff = val2 - val1
    return f"**+{diff:.2f}**" if diff > 0 else f"{diff:.2f}"

Formatting utilities:

Convert values to percentages
Format with optional error margins
Highlight positive differences with bold markdown

JSON File Discovery

def find_json_file(base_path):
    pattern = os.path.join(base_path, "**", "*_results.json")
    json_files = glob.glob(pattern, recursive=True)
    return json_files[0] if json_files else None

Recursively searches for result JSON files in the output directory.

Main Workflow

def main():
    args = parse_args()

    args.branches = args.branches.split(",") if isinstance(args.branches, str) else args.branches
    args.models = args.models.split(",") if isinstance(args.models, str) else args.models
    args.tasks = ALL_TASKS if args.tasks == "all_tasks" else utils.pattern_match(args.tasks.split(","), ALL_TASKS) if isinstance(args.tasks, str) else args.tasks

    global initial_branch
    initial_branch = subprocess.check_output("git branch --show-current", shell=True).decode("ascii").strip()

    results, runtime = eval_models(args)
    print(results, runtime)

    runs = []
    for branch in args.branches:
        runs.append((branch, *eval_models(args, branch)))

    os.system(f"git checkout {initial_branch}")

    print("")
    print(f"|task|{'|'.join(map(lambda model: Path(model).name, args.models))}|")
    print(f"|--|{'--|' * len(args.models)}")
    for task in args.tasks:
        print(f"|{task} ({initial_branch})|{'|'.join(map(lambda model: format_value(args, results, model, task), args.models))}|")
        for branch, branch_results, branch_runtime in runs:
            print(f"|{task} ({branch})|{'|'.join(map(lambda model: format_value(args, branch_results, model, task), args.models))}|")
            print(f"|{task} (diff)|{'|'.join(map(lambda model: format_diff(args, results, branch_results, model, task), args.models))}|")

    print("")
    print("|branch|runtime|%|")
    print("|--|--|--|")
    print(f"|{initial_branch}|{runtime:.1f}s|100%|")
    for branch, _, branch_runtime in runs:
        print(f"|{branch}|{branch_runtime:.1f}s|{100 * branch_runtime / runtime:.2f}%|")


if __name__ == "__main__":
    main()

The main workflow: 1. Parses command-line arguments 2. Normalizes branches, models, and tasks (supports comma-separated lists and "all_tasks") 3. Captures the initial git branch 4. Runs evaluation on current branch (baseline) 5. Iterates through comparison branches, running evaluations on each 6. Returns to initial branch 7. Generates two markdown tables:

  * Performance table: Shows scores for each task on each branch and differences
  * Runtime table: Compares execution time across branches

Output Format

Performance Table

``` |task|model_name| |--|--| |ai2d (main)|45.23| |ai2d (feature-branch)|46.15| |ai2d (diff)|**+0.92**| |mmmu_val (main)|38.50| |mmmu_val (feature-branch)|38.50| |mmmu_val (diff)|0.00| ```

Runtime Table

``` |branch|runtime|%| |--|--|--| |main|1234.5s|100%| |feature-branch|1189.2s|96.33%| ```

Design Patterns

Git Branch Testing

The tool automatically switches between branches to evaluate code changes, making it suitable for pre-merge regression testing.

Multi-Process Evaluation

Uses accelerate with 8 processes for parallel evaluation, significantly reducing test time.

Baseline Comparison

Always evaluates the current branch first as the baseline, then compares other branches against it.

Markdown Output

Generates markdown tables suitable for GitHub issues/PRs, with bold formatting for positive differences.

Task-Specific Metrics

The extraction logic handles different metric names for different tasks, abstracting away the heterogeneity.

Automatic Cleanup

Returns to the initial branch after testing, ensuring the repository state is preserved.

Usage Example

# Compare current branch against feature-branch
python tools/regression.py --branches feature-branch

# Test multiple branches
python tools/regression.py --branches feature-1,feature-2,feature-3

# Custom model and tasks
python tools/regression.py \
    --branches optimize-inference \
    --models lmms-lab/llava-onevision-qwen2-7b-ov \
    --tasks ocrbench,mmmu_val \
    --limit 100

Evaluation Command

The constructed command uses:

accelerate: Distributed evaluation framework
8 processes: Parallel inference
Port 12580: Fixed port to avoid conflicts
Output path: Timestamped logs in `logs/regression_test/`

Limitations

Model indexing: Assumes `model_types[indx]` aligns with `args.models[indx]`
Single metric per task: Only extracts one primary metric per task
No error handling: If evaluation fails, returns empty results
In-place branch switching: Modifies working directory state

Dependencies

argparse: Command-line parsing
glob: File pattern matching
json: Result file parsing
os: System commands and file operations
subprocess: Git command execution
time: Runtime measurement
pathlib: Path manipulation
lmms_eval: Task registry and utilities

Related Components

Principle: Regression_Testing_Principle
Evaluation entrypoint: `lmms_eval/__main__.py`
Task registry: `lmms_eval/api/registry.py`

Attribution

The code is adapted from EleutherAI's LM Evaluation Harness regression testing script.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment