Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Regression Testing Principle

From Leeroopedia
Revision as of 18:26, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/EvolvingLMMs_Lab_Lmms_eval_Regression_Testing_Principle.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

The Regression Testing principle defines the methodology for automated performance comparison across code changes in lmms_eval. It enables developers to validate that changes don't degrade model performance while potentially improving speed or capabilities.

Core Concepts

Baseline Comparison

All regression tests compare against a baseline:

  • Current branch serves as baseline
  • Comparison branches are evaluated in sequence
  • Differences are computed and highlighted
  • Both improvements and regressions are detected

Multi-Dimensional Testing

Regression tests evaluate multiple aspects:

  • Accuracy: Model performance on tasks
  • Runtime: Execution speed
  • Memory: Resource utilization (potential)
  • Consistency: Result reproducibility

Branch-Based Testing

Tests operate on git branches:

  • Automatic branch switching
  • Clean state between tests
  • Return to original branch after completion
  • Suitable for CI/CD integration

Statistical Reporting

Results are presented in structured format:

  • Markdown tables for readability
  • Percentage values for interpretability
  • Difference highlighting (bold for improvements)
  • Runtime comparison as percentage of baseline

Design Principles

Automated Workflow

The testing process should be:

  • Fully automated from command invocation to result reporting
  • Require minimal manual intervention
  • Handle branch switching automatically
  • Clean up state after completion

Comprehensive Coverage

Tests should cover:

  • Representative task selection (single-image, multi-image, video)
  • Multiple models when appropriate
  • Key performance metrics for each task
  • Both accuracy and speed

Clear Communication

Results should be:

  • Easy to interpret at a glance
  • Suitable for GitHub issues/PRs
  • Highlight significant changes
  • Include both absolute values and differences

Reproducible Execution

Test runs should be:

  • Deterministic when possible (temperature=0)
  • Use consistent hardware configuration
  • Document all parameters
  • Save full results for analysis

Implementation Guidelines

Baseline Establishment

The baseline should:

  • Run first before any branch switching
  • Use current branch state
  • Capture full result metadata
  • Measure runtime accurately

Branch Evaluation

For each comparison branch: 1. Switch to the branch 2. Run identical evaluation 3. Capture results and runtime 4. Return to baseline branch 5. Compare against baseline

Metric Selection

Choose metrics that:

  • Represent task performance accurately
  • Are consistent across runs
  • Align with published benchmarks
  • Can be extracted automatically

Result Formatting

Format results as:

  • Markdown tables for GitHub
  • Percentage values for clarity
  • Bold formatting for positive changes
  • Clear branch identification

Key Operations

Model Evaluation

Run evaluation with:

  • Distributed inference (accelerate)
  • Fixed random seed for reproducibility
  • Consistent generation parameters
  • Timestamped output paths

Metric Extraction

Extract task-specific metrics:

  • Map tasks to primary metrics
  • Handle missing results gracefully
  • Support multiple metric types
  • Return zero for unavailable metrics

Difference Calculation

Compare results:

  • Compute absolute differences
  • Convert to percentages
  • Determine improvement vs. regression
  • Format with appropriate precision

Runtime Analysis

Analyze execution time:

  • Measure total runtime per branch
  • Compute percentage relative to baseline
  • Identify performance optimizations
  • Detect performance regressions

Usage Patterns

Pre-Merge Regression Testing

# Test feature branch before merging
git checkout main
python tools/regression.py --branches feature-branch

# Review output tables in terminal
# Post to PR for review

Multi-Branch Comparison

# Compare several optimization approaches
python tools/regression.py \
    --branches opt-v1,opt-v2,opt-v3 \
    --tasks ocrbench,mmmu_val \
    --limit 100

Quick Smoke Test

# Fast test with small limit
python tools/regression.py \
    --branches experimental \
    --tasks ai2d \
    --limit 10

Evaluation Configuration

Distributed Inference

Use accelerate for parallelization:

  • Multiple processes (typically 8)
  • Fixed port to avoid conflicts
  • Proper GPU allocation
  • Synchronization between processes

Task Selection

Include diverse tasks:

  • Single-image: OCRBench, MMMU, AI2D
  • Multi-image: MUIRBench
  • Video: VideoMME
  • Representative of use cases

Generation Parameters

Use consistent settings:

  • temperature=0 for deterministic output
  • Fixed batch size
  • Consistent max_new_tokens
  • Same model arguments

Output Format

Performance Table

Structure: ``` |task|model1|model2|...| |--|--|--|...| |task_name (baseline)|score1|score2|...| |task_name (branch)|score1|score2|...| |task_name (diff)|diff1|diff2|...| ```

Characteristics:

  • One row per task per branch
  • Difference row after each branch
  • Bold for positive differences
  • Clear branch identification

Runtime Table

Structure: ``` |branch|runtime|%| |--|--|--| |baseline|1234.5s|100%| |branch1|1189.2s|96.33%| ```

Characteristics:

  • Absolute runtime in seconds
  • Percentage relative to baseline
  • Sorted by execution order

Best Practices

For Test Design

  • Include representative task mix
  • Use meaningful sample sizes
  • Balance speed and coverage
  • Document expected behaviors

For Result Interpretation

  • Consider statistical significance
  • Look for patterns across tasks
  • Investigate unexpected regressions
  • Verify improvements are real

For CI Integration

  • Set appropriate thresholds
  • Allow for small variations
  • Cache models/datasets
  • Fail on significant regressions

For Performance Testing

  • Run on consistent hardware
  • Avoid interference from other processes
  • Warm up before timing
  • Multiple runs for stability

Error Handling

Branch Switching Failures

  • Detect failed checkouts
  • Return empty results for failed branches
  • Don't abort entire run
  • Report which branches failed

Evaluation Failures

  • Catch subprocess errors
  • Log failure details
  • Continue with remaining branches
  • Return zero for failed tasks

Result Parsing Failures

  • Handle missing JSON files
  • Gracefully handle malformed results
  • Report parsing errors
  • Use zero for missing metrics

Integration Points

Git Integration

  • Query current branch
  • Checkout comparison branches
  • Restore original branch
  • Handle dirty working trees

Accelerate Integration

  • Launch with correct parameters
  • Manage process lifecycle
  • Handle port allocation
  • Collect distributed results

Task Registry

  • Support "all_tasks" keyword
  • Pattern matching for task selection
  • Validate task existence
  • Map to evaluation commands

Limitations and Considerations

Model Alignment

  • Assumes model_types align with model list
  • Index-based mapping can be fragile
  • Consider more robust configuration

Single Metric per Task

  • Only primary metric is compared
  • Secondary metrics ignored in summary
  • Full results available in JSON

Sequential Evaluation

  • Branches tested sequentially
  • Long runtime for many branches
  • Consider parallel branch testing

Working Directory State

  • Modifies git state
  • May conflict with local changes
  • Best run on clean working tree

Related Principles

Implementations

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment