Principle:EvolvingLMMs Lab Lmms eval Regression Testing Principle

Overview

The Regression Testing principle defines the methodology for automated performance comparison across code changes in lmms_eval. It enables developers to validate that changes don't degrade model performance while potentially improving speed or capabilities.

Core Concepts

Baseline Comparison

All regression tests compare against a baseline:

Current branch serves as baseline
Comparison branches are evaluated in sequence
Differences are computed and highlighted
Both improvements and regressions are detected

Multi-Dimensional Testing

Regression tests evaluate multiple aspects:

Accuracy: Model performance on tasks
Runtime: Execution speed
Memory: Resource utilization (potential)
Consistency: Result reproducibility

Branch-Based Testing

Tests operate on git branches:

Automatic branch switching
Clean state between tests
Return to original branch after completion
Suitable for CI/CD integration

Statistical Reporting

Results are presented in structured format:

Markdown tables for readability
Percentage values for interpretability
Difference highlighting (bold for improvements)
Runtime comparison as percentage of baseline

Design Principles

Automated Workflow

The testing process should be:

Fully automated from command invocation to result reporting
Require minimal manual intervention
Handle branch switching automatically
Clean up state after completion

Comprehensive Coverage

Tests should cover:

Representative task selection (single-image, multi-image, video)
Multiple models when appropriate
Key performance metrics for each task
Both accuracy and speed

Clear Communication

Results should be:

Easy to interpret at a glance
Suitable for GitHub issues/PRs
Highlight significant changes
Include both absolute values and differences

Reproducible Execution

Test runs should be:

Deterministic when possible (temperature=0)
Use consistent hardware configuration
Document all parameters
Save full results for analysis

Implementation Guidelines

Baseline Establishment

The baseline should:

Run first before any branch switching
Use current branch state
Capture full result metadata
Measure runtime accurately

Branch Evaluation

For each comparison branch: 1. Switch to the branch 2. Run identical evaluation 3. Capture results and runtime 4. Return to baseline branch 5. Compare against baseline

Metric Selection

Choose metrics that:

Represent task performance accurately
Are consistent across runs
Align with published benchmarks
Can be extracted automatically

Result Formatting

Format results as:

Markdown tables for GitHub
Percentage values for clarity
Bold formatting for positive changes
Clear branch identification

Key Operations

Model Evaluation

Run evaluation with:

Distributed inference (accelerate)
Fixed random seed for reproducibility
Consistent generation parameters
Timestamped output paths

Metric Extraction

Extract task-specific metrics:

Map tasks to primary metrics
Handle missing results gracefully
Support multiple metric types
Return zero for unavailable metrics

Difference Calculation

Compare results:

Compute absolute differences
Convert to percentages
Determine improvement vs. regression
Format with appropriate precision

Runtime Analysis

Analyze execution time:

Measure total runtime per branch
Compute percentage relative to baseline
Identify performance optimizations
Detect performance regressions

Usage Patterns

Pre-Merge Regression Testing

# Test feature branch before merging
git checkout main
python tools/regression.py --branches feature-branch

# Review output tables in terminal
# Post to PR for review

Multi-Branch Comparison

# Compare several optimization approaches
python tools/regression.py \
    --branches opt-v1,opt-v2,opt-v3 \
    --tasks ocrbench,mmmu_val \
    --limit 100

Quick Smoke Test

# Fast test with small limit
python tools/regression.py \
    --branches experimental \
    --tasks ai2d \
    --limit 10

Evaluation Configuration

Distributed Inference

Use accelerate for parallelization:

Multiple processes (typically 8)
Fixed port to avoid conflicts
Proper GPU allocation
Synchronization between processes

Task Selection

Include diverse tasks:

Single-image: OCRBench, MMMU, AI2D
Multi-image: MUIRBench
Video: VideoMME
Representative of use cases

Generation Parameters

Use consistent settings:

temperature=0 for deterministic output
Fixed batch size
Consistent max_new_tokens
Same model arguments

Output Format

Performance Table

Structure: ``` |task|model1|model2|...| |--|--|--|...| |task_name (baseline)|score1|score2|...| |task_name (branch)|score1|score2|...| |task_name (diff)|diff1|diff2|...| ```

Characteristics:

One row per task per branch
Difference row after each branch
Bold for positive differences
Clear branch identification

Runtime Table

Structure: ``` |branch|runtime|%| |--|--|--| |baseline|1234.5s|100%| |branch1|1189.2s|96.33%| ```

Characteristics:

Absolute runtime in seconds
Percentage relative to baseline
Sorted by execution order

Best Practices

For Test Design

Include representative task mix
Use meaningful sample sizes
Balance speed and coverage
Document expected behaviors

For Result Interpretation

Consider statistical significance
Look for patterns across tasks
Investigate unexpected regressions
Verify improvements are real

For CI Integration

Set appropriate thresholds
Allow for small variations
Cache models/datasets
Fail on significant regressions

For Performance Testing

Run on consistent hardware
Avoid interference from other processes
Warm up before timing
Multiple runs for stability

Error Handling

Branch Switching Failures

Detect failed checkouts
Return empty results for failed branches
Don't abort entire run
Report which branches failed

Evaluation Failures

Catch subprocess errors
Log failure details
Continue with remaining branches
Return zero for failed tasks

Result Parsing Failures

Handle missing JSON files
Gracefully handle malformed results
Report parsing errors
Use zero for missing metrics

Integration Points

Git Integration

Query current branch
Checkout comparison branches
Restore original branch
Handle dirty working trees

Accelerate Integration

Launch with correct parameters
Manage process lifecycle
Handle port allocation
Collect distributed results

Task Registry

Support "all_tasks" keyword
Pattern matching for task selection
Validate task existence
Map to evaluation commands

Limitations and Considerations

Model Alignment

Assumes model_types align with model list
Index-based mapping can be fragile
Consider more robust configuration

Single Metric per Task

Only primary metric is compared
Secondary metrics ignored in summary
Full results available in JSON

Sequential Evaluation

Branches tested sequentially
Long runtime for many branches
Consider parallel branch testing

Working Directory State

Modifies git state
May conflict with local changes
Best run on clean working tree

Related Principles

Environment_Setup: Setting up evaluation environment
Model_Inference: Running model inference
Results_Output: Result file formats
Metric_Aggregation: Computing metrics

Implementations

Regression_Testing: Main regression testing script
Implementation:EvolvingLMMs_Lab_Lmms_eval_Regression_Testing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment